EU-SE-1 data center that sometimes when the model has been loaded and the training is about to start the process hangs right after moving some of the parameters to the gpus (indefinietly as far as I can tell). NCCL_P2P_DISABLE=1 disabling P2P transport between gpus; however, this in turn causes issue when tensor parallelism is enabled as it creates data inconsistencies between gpus. EU-SE-1 only sometimes so I'd imagine this is an issue in only some of the nodes in that data center? I've found this with multiple different containers and models so I don't think it's an issue with my code as it runs fine in other regions/data centres