echozhou
RRunPod
•Created by echozhou on 12/31/2023 in #⛅|pods-clusters
How to use runpod for multi-machine distributed training?
We have applied symmetric ports for multiple machines and configured them to be accessible to each other via ssh. But it is not possible to do distributed training.
The commands we are using :
torchrun --nproc_per_node=1 \
--nnodes=2 \
--node_rank=0 \
---master_addr="216.249.100.66" \
--master_port=12619 \
test.py
torchrun --nproc_per_node=1 \
--nnodes=2 \
--node_rank=1 \
---master_addr="216.249.100.66" \
--master_port=12619 \
test.py
We are using RunPod Pytorch 2.1.
4 replies