Runpod•16mo ago

Not 1:1 port mappings for multinode training

Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port.
Is this a problem other people have had and is there a solution?
Thanks in advance!

PoddyAPP•8/29/24, 1:38 PM

@Bob

Escalated To Zendesk

The thread has been escalated to Zendesk!

Jason•8/29/24, 1:39 PM

I'm not quite familiar with this, but from what I'm getting the external port of the "child worker" pods should be the same as internal of master? How does that work

Jason•8/29/24, 1:41 PM

Btw I think it's best not to do cluster training like this unless there is some private networking connection

yhlong00000•8/29/24, 9:51 PM

We are currently developing a cluster feature that will help your case in the future.

Yyhlong00000 We are currently developing a cluster feature that will help your case in the fu...

BobOP•8/30/24, 12:51 AM

Sounds cool - whats the ETA and is there any way to get involved early

yhlong00000•8/30/24, 3:02 AM

So far I don’t know the ETA, will announce it when it is ready~

Not 1:1 port mappings for multinode training

Similar Threads

Similar Threads

Similar Threads