R
Runpod3mo ago
Mil

Instant Cluster DDP config not working

I created an instant cluster with a couple nodes, but torch DDP isn't working - seems like nodes can't talk to each other. Documentation says that instant cluster pods have relevant env vars created by default, & ports open, which doesn't seem to be true. I checked via ssh sessions. root@node-1:~# env SHELL=/bin/bash SSH_AUTH_SOCK=/tmp/ssh-XXXXWeSELc/agent.128 PWD=/root LOGNAME=root RUNPOD_CPU_COUNT=252 MOTD_SHOWN=pam HOME=/root LANG=C.UTF-8 LS_COLORS= (stripped because too long & irrelevant here) RUNPOD_POD_ID=mmcaunrj2xcix6 SSH_CONNECTION=150.228.3.3 28789 172.18.0.2 22 RUNPOD_MEM_GB=1415 RUNPOD_PUBLIC_IP=185.216.20.89 RUNPOD_VOLUME_ID=qrw769kbpg RUNPOD_GPU_COUNT=8 TERM=xterm-256color RUNPOD_POD_HOSTNAME=mmcaunrj2xcix6-64410c11 USER=root RUNPOD_DC_ID=CA-MTL-3 SHLVL=1 RUNPOD_GPU_NAME=NVIDIA+H100+PCIe SSH_CLIENT=150.228.3.3 28789 22 RUNPOD_TCP_PORT_22=41987 RUNPOD_API_KEY=(stripped for security reasons) PATH=/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin SSHTTY=/dev/pts/0 =/usr/bin/env root@node-1:~# echo $PRIMARY_ADDR root@node-1:~#
No description
4 Replies
Poddy
Poddy3mo ago
@Mil
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #21595
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Mil
MilOP3mo ago
What's the typical Instant Cluster config for torch DDP? Is there a specific template to select?
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?