P2P transport between gpus issue in EU-SE-1

I've been training LLMs using deepspeed.

However, I've noticed that when the pod is created in the

EU-SE-1

EU-SE-1

data center that sometimes when the model has been loaded and the training is about to start the process hangs right after moving some of the parameters to the gpus (indefinietly as far as I can tell).
The only way to prevent this I've found so far is to set the env var

NCCL_P2P_DISABLE=1

NCCL_P2P_DISABLE=1

disabling P2P transport between gpus; however, this in turn causes issue when tensor parallelism is enabled as it creates data inconsistencies between gpus.

This doesn't always happen when the pod is in

EU-SE-1

EU-SE-1

only sometimes so I'd imagine this is an issue in only some of the nodes in that data center? I've found this with multiple different containers and models so I don't think it's an issue with my code as it runs fine in other regions/data centres

Runpod•8mo ago•

10 replies

P2P transport between gpus issue in EU-SE-1

I've been training LLMs using deepspeed.

However, I've noticed that when the pod is created in the

EU-SE-1

EU-SE-1

NCCL_P2P_DISABLE=1

NCCL_P2P_DISABLE=1

EU-SE-1

EU-SE-1

P2P transport between gpus issue in EU-SE-1

Similar Threads

P2P transport between gpus issue in EU-SE-1

Similar Threads

Similar Threads

Similar Threads