Training error about NCCL


I just tried to retrain the network, however, when I use just one GPUs, I was told that the CUDA out of memory.

So I set the GPUs to 4. Then I got this error

Runtimeerror: distributed package doesn't have nccl built in

Could anybody tell me how to solve that?

Thank you in advance

Hello @NayeeC, this is very strange. NCCL should be included with PyTorch. I would check on the PyTorch forums. If you have other fastMRI-specific questions we can follow-up on the GitHub forum.