RunpodR
Runpod7mo ago
RJ

P2P transport between gpus issue in EU-SE-1

I've been training LLMs using deepspeed.

However, I've noticed that when the pod is created in the EU-SE-1 data center that sometimes when the model has been loaded and the training is about to start the process hangs right after moving some of the parameters to the gpus (indefinietly as far as I can tell).
The only way to prevent this I've found so far is to set the env var NCCL_P2P_DISABLE=1 disabling P2P transport between gpus; however, this in turn causes issue when tensor parallelism is enabled as it creates data inconsistencies between gpus.

This doesn't always happen when the pod is in EU-SE-1 only sometimes so I'd imagine this is an issue in only some of the nodes in that data center? I've found this with multiple different containers and models so I don't think it's an issue with my code as it runs fine in other regions/data centres
Was this page helpful?