R
Runpod15mo ago
Bob

Not 1:1 port mappings for multinode training

Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port. Is this a problem other people have had and is there a solution? Thanks in advance!
5 Replies
Poddy
Poddy15mo ago
@Bob
Escalated To Zendesk
The thread has been escalated to Zendesk!
Unknown User
Unknown User15mo ago
Message Not Public
Sign In & Join Server To View
yhlong00000
yhlong0000015mo ago
We are currently developing a cluster feature that will help your case in the future.🙏🏻
Bob
BobOP15mo ago
Sounds cool - whats the ETA and is there any way to get involved early 🙂
yhlong00000
yhlong0000015mo ago
So far I don’t know the ETA, will announce it when it is ready~

Did you find this page helpful?