h100 servers having issues?

Hey RunPod folks, is something going on with the h100 secure cloud machines? I first got a number of weird issues on a 8xH100 (SXM) server (cross GPU links going down randomly? Hard to say what is exactly going on - I get random timeouts in multi GPU comms after days of work). I tried spinning a new machine (ID: nyotnwudbsq0mu, ID: 23xahufe1yk33g) but they are stuck loading the docker images from our private Docker (that works great and I can access from other RunPod machines). Can someone please have a look?
19 Replies
AstraliteHeart
AstraliteHeartOP7mo ago
I am also having issues with web ssh just getting stuck on a loading/white page even for machines I can SSH to normally. As a data point - just spinned a community cloud machine (8xh100 SXM) and everything works great, web ssh connects, it loaded Docker super fast and so far no inter GPU connectivity issues.
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Poddy
Poddy7mo ago
@AstraliteHeart
Escalated To Zendesk
The thread has been escalated to Zendesk!
riverfog7
riverfog77mo ago
@AstraliteHeart there was a thread with a solution to not so good gpu interconnects
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
Unfortunately forgot the name It was a post ablut H100s i thin
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
It was related to something NVSwitch
AstraliteHeart
AstraliteHeartOP7mo ago
OCT
YouTube
HALF HORSE HALF MAN | OFFICIAL VIDEO
#epicmusic #comedymusic #music #eurovision Make sure to Like and Subscribe! ► Merch: https://octmusic.myshopify.com/ ► Pre-save Half Horse Half Man: https://distrokid.com/hyperfollow/oct2/half-horse-half-man-2 Help fund our debut album: https://www.paypal.com/ncp/payment/RG348NRLYTE28 Half Horse Half Man out on all streaming platforms o...
AstraliteHeart
AstraliteHeartOP7mo ago
was that about disabling p2p by any chance? here or on the website?
riverfog7
riverfog77mo ago
i think it was sth about nvlink
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
run commands to change some stuff and solved it related to env variables
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
#CUDA device uncorrectable ECC error probably related
Dj
Dj7mo ago
Hi and sorry for the delay in responding here, the GPUs on this machine have been reset so they should(?) be good to go again.
fluid
fluid7mo ago
Im also getting just a white page on every gpu im trying. Dont know if this is a issue with the service?
Dj
Dj7mo ago
That sounds unrelated, do you want to make another thread explaining the issue and I can help you out?
riverfog7
riverfog77mo ago
Probably the same cloudflare issue

Did you find this page helpful?