Topics

Runpod•15mo ago

Not 1:1 port mappings for multinode training

Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port. Is this a problem other people have had and is there a solution? Thanks in advance!

5 Replies

Poddy•15mo ago

@Bob

Escalated To Zendesk

The thread has been escalated to Zendesk!

Unknown User•15mo ago

Message Not Public

Sign In & Join Server To View

yhlong00000•15mo ago

We are currently developing a cluster feature that will help your case in the future.🙏🏻

BobOP•15mo ago

Sounds cool - whats the ETA and is there any way to get involved early 🙂

yhlong00000•15mo ago

So far I don’t know the ETA, will announce it when it is ready~

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

20KMembers

View on Discord

Did you find this page helpful?