Instant Cluster DDP config not working
I created an instant cluster with a couple nodes, but torch DDP isn't working - seems like nodes can't talk to each other. Documentation says that instant cluster pods have relevant env vars created by default, & ports open, which doesn't seem to be true. I checked via ssh sessions.
root@node-1:~# env
SHELL=/bin/bash
SSH_AUTH_SOCK=/tmp/ssh-XXXXWeSELc/agent.128
PWD=/root
LOGNAME=root
RUNPOD_CPU_COUNT=252
MOTD_SHOWN=pam
HOME=/root
LANG=C.UTF-8
LS_COLORS= (stripped because too long & irrelevant here)
RUNPOD_POD_ID=mmcaunrj2xcix6
SSH_CONNECTION=150.228.3.3 28789 172.18.0.2 22
RUNPOD_MEM_GB=1415
RUNPOD_PUBLIC_IP=185.216.20.89
RUNPOD_VOLUME_ID=qrw769kbpg
RUNPOD_GPU_COUNT=8
TERM=xterm-256color
RUNPOD_POD_HOSTNAME=mmcaunrj2xcix6-64410c11
USER=root
RUNPOD_DC_ID=CA-MTL-3
SHLVL=1
RUNPOD_GPU_NAME=NVIDIA+H100+PCIe
SSH_CLIENT=150.228.3.3 28789 22
RUNPOD_TCP_PORT_22=41987
RUNPOD_API_KEY=(stripped for security reasons)
PATH=/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
SSHTTY=/dev/pts/0
=/usr/bin/env
root@node-1:~# echo $PRIMARY_ADDR
root@node-1:~#

4 Replies
@Mil
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #21595
Unknown User•3mo ago
Message Not Public
Sign In & Join Server To View
What's the typical Instant Cluster config for torch DDP? Is there a specific template to select?
Unknown User•3mo ago
Message Not Public
Sign In & Join Server To View