R
RunPod•3w ago
41see

CUDA device uncorrectable ECC error

I'm using a 5xH100 pod and got uncorrectable ECC error for device 1,2,3. Device 0 and 4 can be used without a problem. It seems the device or the system needs a reboot. Any help on this? I've already submitted a ticket on the website with the pod id. Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import torch torch.cuda.device_count() 5 torch.tensor([1], device='cuda:0') tensor([1], device='cuda:0') torch.tensor([1], device='cuda:1') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:2') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:3') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:4') tensor([1], device='cuda:4')
56 Replies
Dj
Dj•3w ago
Hey, can you share your pod id or dm me your account email?
Jason
Jason•3w ago
Your container cuda version, and what cuda version is your pod? Run nvidia-smi and nvcc - - version i think to check
bghira
bghira•3w ago
actually same error here occuring on H100 pods. (secure cloud) @Dj pod id is hs6vtqnu343wwj
Jason
Jason•3w ago
@bghira just wondering
bghira
bghira•3w ago
Cu124 it's H100s. it's not going to be 11.8. 🙂
Jason
Jason•3w ago
All 12.4? Try 12.8
bghira
bghira•3w ago
it's not the problem.. also, 12.8 will need an entirely different version of pytorch. that isn't something someone can do in production
Jason
Jason•3w ago
Oh I see what's the problem?
bghira
bghira•3w ago
see the original post in this thread 🫠 support looking at the ticket with the pod ID. i'll update here if they give any more insight into the reason at my org we have something like 200-300 H100s and i've seen CUDA permissions error or NVML init error, but never this kinda ongoing ECC error, it's new to me, so i'm curious too
Dj
Dj•3w ago
I'm also taking a look, we had a small outage yesterday which maybe related but I'm working on going through the relevant logs (similiar to what support would be doing)
bghira
bghira•3w ago
been there. i understand thanks for taking a look
riverfog7
riverfog7•3w ago
can confirm im dying waiting for cuda kernels to compile
Dj
Dj•3w ago
Are you on the same org or just also seeing the same error?
bghira
bghira•3w ago
different case
riverfog7
riverfog7•3w ago
my experience with H100s were fine (cuz i dont use them a lot)
bghira
bghira•3w ago
this is a pretty rare issue for SXM5 systems but i'm on H100 PCIe which are a lot more "meh" from past experience (not a RunPod problem, it's an NVIDIA problem)
riverfog7
riverfog7•3w ago
that was about making vllm to work with blackwell
Dj
Dj•3w ago
Just got freed from my meeting, hunting down the error now
riverfog7
riverfog7•3w ago
i think someone got nvlink errors on H100s in other thread
bghira
bghira•3w ago
that happens if the host system has the fabric manager crash
riverfog7
riverfog7•3w ago
yeah mentioned this issue there
bghira
bghira•3w ago
must be the outage mentioned from yesterday you know the other annoying thing is if the OS auto-updates the nvidia drivers :KEKLEO:
riverfog7
riverfog7•3w ago
lol
Jason
Jason•3w ago
What's a fabric manager
Dj
Dj•3w ago
@yhlong00000 Can you check machine nang0aeab9ed for this ECC error on the host? Edit: fmylxpa4t5so as well?
riverfog7
riverfog7•3w ago
Jason
Jason•3w ago
Wow that's complicated things
riverfog7
riverfog7•3w ago
google's TPUs are more insane 😆
bghira
bghira•3w ago
it's really frustrating too TPUs are easy versus Cerebras crap
riverfog7
riverfog7•3w ago
wait.. what? i heard they made a wafer level chip maybe wafer sized but idk
yhlong00000
yhlong00000•3w ago
I've unlisted the machine for now and will have infra team check further.
bghira
bghira•3w ago
well i went to cu128 container as recommended and one of the GPUs still gives ECC error i think
Dj
Dj•3w ago
Can you share the pod id? I'm still waiting on a followup from infra
Jason
Jason•3w ago
seems like it happens to alot of machines
Dj
Dj•3w ago
It shouldn't 😧
Jason
Jason•3w ago
just curious how many gpu's were you using, and which type?
Dj
Dj•3w ago
The GPUs on both machines (this thread and it's sibling) have been reset, they should be good to go :)
riverfog7
riverfog7•3w ago
wow super fast support 1 business day delay time 🙂 (obviously excluding saturday / sunday)
Jason
Jason•3w ago
Even faster if the DC engineer can stay up 24h hahah
riverfog7
riverfog7•3w ago
Thats brutal lol And cloudfront seems to have issues so can you plz chech that
Dj
Dj•3w ago
I don't think I can do anything about that 😭 What do you mean?
Jason
Jason•3w ago
Maybe the proxy, cloudflare
riverfog7
riverfog7•3w ago
Yeah 3 ppl had issues
riverfog7
riverfog7•3w ago
Http proxy being so slow They said it got modem era speeds, 100kbps
Dj
Dj•3w ago
Oh okay, Cloudfront is a different company that Cloudflare I was very confused
riverfog7
riverfog7•3w ago
Wow Why did i say cloudfront
Dj
Dj•3w ago
I have a meeting right now but I'll be back soon to look
Jason
Jason•3w ago
Hahah you use aws alot?
riverfog7
riverfog7•3w ago
Yeah Its 12:00here so probably lack of sleep issues
Jason
Jason•3w ago
Yup good night
bghira
bghira•2w ago
Based on the information you’ve provided, we don’t believe this is an issue with the RunPod platform. Unfortunately, without more specific GPU details, our reliability team is unable to investigate further or escalate the matter to our hardware vendor. We’ve also reviewed our monitoring metrics and didn’t observe any ECC errors in the past 30 days, as I mentioned earlier.
:maaaaan: i can't even believe the audacity of the support team to respond like this, do you guys want us wasting money running broken pods? really? blaming me even when these threads here around the same time were indicating it was a shared issue by more than one user :maaaaan:
Dj
Dj•2w ago
Can you share your pod ID? It is technically a fault of our platform that we distributed you onto this hardware while it is in this state. It's very likely you were placed onto the physical server this thread was created for and it's something I am personally actively working on with our hardware partner, they're sending this server back online without doing the proper work. We delist it, they say "it's fixed :)" and send it back, and the same server has a problem reported. Naturally this is absolutely not the experience we want you to have. I'm not particularly weekend support, so I'm not home right now but I'll find your ticket based on the response you got and look into this and follow up with you later tonight. @bghira
bghira
bghira•2w ago
it's this one but it's gone now i know how this feels. i have a free 8xH100 system that "fell out of a billing system" and its 100gbps port flaps so it's like, cool, i guess, but also kinda useless. i told the vendor. they haven't done anything in ~2 weeks
Dj
Dj•2w ago
Logs are forever (14 days), I'll find it :) Okay, yeah with the context of knowing you were one of the first reports this is absolutely a misfire from support. The machine in question is the problematic machine that keeps getting relisted by its host.

Did you find this page helpful?