Not 1:1 port mappings for multinode training
Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port.
Is this a problem other people have had and is there a solution?
Thanks in advance!
5 Replies
@Bob
Escalated To Zendesk
The thread has been escalated to Zendesk!
Unknown User•15mo ago
Message Not Public
Sign In & Join Server To View
We are currently developing a cluster feature that will help your case in the future.🙏🏻
Sounds cool - whats the ETA and is there any way to get involved early 🙂
So far I don’t know the ETA, will announce it when it is ready~