Not 1:1 port mappings for multinode training
Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port.
Is this a problem other people have had and is there a solution?
Thanks in advance!
Is this a problem other people have had and is there a solution?
Thanks in advance!