R
RunPod•5mo ago
ART01

Multi-node training with multiple pods sharing same region.

I am trying multi-node training with multiple pods. When I launched multiple pods with same region, they share same public IP, but only port is different. How should I specify the proper port and IP for multi-node training? Does secure cloud offers multi-node training?
13 Replies
ashleyk
ashleyk•5mo ago
You shouldn't launch multiple pods for multiple GPU training. Launch a single pod and use the dropdown to select how many GPUs you want to attach to it.
ART01
ART01•5mo ago
Thank you for your quick reply! I am trying to test 32 gpus training, so I thought I should run 4 pods (each node might have 8 GPUs, which are the maximum number of available GPU for single pod). Is the single pod could have 32 GPUs? If multi-node training with secure cloud is impossible, is there any way to test multi-node training? I need to test the speed of multi-node training for deciding the long term contract.
ashleyk
ashleyk•5mo ago
If you need something custom with 32 GPUs, you may want to chat with @JM about arranging something for you.
ART01
ART01•5mo ago
@ashleyk I got it. Thank you 😄 @JM Hi, could you please confirm if there is available option for testing multi-node training? It's for network bandwidth test, so I need at least two pods sharing same region. For each pod, 2 GPUs are enough to test multi-node training. Also, about 3~4 hours are enough to have a test.
ashleyk
ashleyk•5mo ago
Oh thats different, thought you wanted 32 GPUs You can do this yourself without involving @JM , @JM can assist with custom things, you don't need to involve him with things you can do yourself.
ART01
ART01•5mo ago
Then, how can I test multi-node training (not multi-gpu training) with runpod? the problem is same as my first question.
ashleyk
ashleyk•5mo ago
You probably need to log a Github issue for whatever application you're using and ask there.
ART01
ART01•5mo ago
I've already performed multi-node training on my server and there were no issues. My question is about network setting of pods. I'm wondering if multiple pods launched from secure cloud can communicate with each other using same port number. When I checked, they are using same public IP and they cannot communicate with their private IP.
ashleyk
ashleyk•5mo ago
No, secure cloud works the same way as community cloud, so best to ask the developer of the software you're using how to implement it as I said.
ART01
ART01•5mo ago
Ah, I got it. Thank you for your reply!
flash-singh
flash-singh•5mo ago
multi node training is a gap we have, we plan to enable some type of internal networking early this year
JM
JM•5mo ago
In the meantime, if you need multi nodes for full servers for 1 month+ (8 for A100/H100 or Ada 6000/L40, and 10 for A6000/A5000/A4000), let us know! We can do baremetal rentals as well @ART01
ART01
ART01•5mo ago
I hope 32 A100 GPUs at least for a 1 month. Before deciding to rent, I want to test the efficiency of multi-node training on your servers. Could we arrange a brief rental period, perhaps a few hours, to ensure it meets my requirements?