Multinode training Runpod ports

I'm trying training a distributed models using multinode, 2xPods x8GPU 4090 for each. We cant train using torchrun, because i need the same TCP port, for each machine, so, runpod assigned me a random external port , command example: NODE A: torchrun --nnodes=2 --node_rank=0 --nproc_per_node=8 --master_addr="" --master_port=52616 scripts/ run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']" NODE B: torchrun --nnodes=2 --node_rank=1 --nproc_per_node=8 --master_addr="" --master_port=52616 scripts/ run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']"
Madiator20115mo ago
external port is always randomised and not symetric
flash-singh5mo ago
we have plans in future to expore multi node training by enabling internal port communication
_manuelcerezo5mo ago
thanks, it's very important thing ffor my team and our development. Actually this is one of the most important way of training models companies are doing.
gotcha2mo ago
Hello, any updates on this? I'm trying to set up multinode training using deepseed and found little to no information about this online. Thanks!
flash-singh2mo ago
its still in progress, right now estimate is sometime in May, it will be a whole new feature "Training Cluster"
