RunpodR
Runpod2y ago
44 replies
wallscalr

[Urgent] failed : Software caused connection abort

Can someone help with this error please? it's causing us a huge problem with our next release.

Trying to connect two different computers with pytorch and lightning via TCP ports. i have followed the directions that runpod advises for opening these ports (>70000):

https://docs.runpod.io/pods/configuration/expose-ports pytorch and nccl appear to start opening the connection just fine and then we get an exception:

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort

Can anyone give some insight into what may be happening here please?
Learn to expose your ports.
Expose ports | RunPod Documentation
Was this page helpful?