P2P transport between gpus issue in EU-SE-1
I've been training LLMs using deepspeed.
However, I've noticed that when the pod is created in the
The only way to prevent this I've found so far is to set the env var
This doesn't always happen when the pod is in
However, I've noticed that when the pod is created in the
EU-SE-1 data center that sometimes when the model has been loaded and the training is about to start the process hangs right after moving some of the parameters to the gpus (indefinietly as far as I can tell). The only way to prevent this I've found so far is to set the env var
NCCL_P2P_DISABLE=1 disabling P2P transport between gpus; however, this in turn causes issue when tensor parallelism is enabled as it creates data inconsistencies between gpus. This doesn't always happen when the pod is in
EU-SE-1 only sometimes so I'd imagine this is an issue in only some of the nodes in that data center? I've found this with multiple different containers and models so I don't think it's an issue with my code as it runs fine in other regions/data centres