CUDA device uncorrectable ECC error
I'm using a 5xH100 pod and got uncorrectable ECC error for device 1,2,3. Device 0 and 4 can be used without a problem. It seems the device or the system needs a reboot. Any help on this? I've already submitted a ticket on the website with the pod id.
Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
import torch torch.cuda.device_count() 5 torch.tensor([1], device='cuda:0') tensor([1], device='cuda:0') torch.tensor([1], device='cuda:1') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:2') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:3') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:4') tensor([1], device='cuda:4')
56 Replies
Hey, can you share your pod id or dm me your account email?
Your container cuda version, and what cuda version is your pod?
Run nvidia-smi and nvcc - - version i think to check
actually same error here occuring on H100 pods. (secure cloud)
@Dj pod id is hs6vtqnu343wwj
@bghira just wondering
Cu124
it's H100s. it's not going to be 11.8. 🙂
All 12.4?
Try 12.8
it's not the problem..
also, 12.8 will need an entirely different version of pytorch. that isn't something someone can do in production
Oh I see what's the problem?
see the original post in this thread ðŸ«
support looking at the ticket with the pod ID. i'll update here if they give any more insight into the reason
at my org we have something like 200-300 H100s and i've seen CUDA permissions error or NVML init error, but never this kinda ongoing ECC error, it's new to me, so i'm curious too
I'm also taking a look, we had a small outage yesterday which maybe related but I'm working on going through the relevant logs (similiar to what support would be doing)
been there. i understand
thanks for taking a look
can confirm im dying waiting for cuda kernels to compile
Are you on the same org or just also seeing the same error?
different case
my experience with H100s were fine (cuz i dont use them a lot)
this is a pretty rare issue for SXM5 systems but i'm on H100 PCIe which are a lot more "meh" from past experience (not a RunPod problem, it's an NVIDIA problem)
that was about making vllm to work with blackwell
Just got freed from my meeting, hunting down the error now
i think someone got nvlink errors on H100s
in other thread
that happens if the host system has the fabric manager crash
yeah
mentioned this issue there
must be the outage mentioned from yesterday
you know the other annoying thing is if the OS auto-updates the nvidia drivers :KEKLEO:
lol
What's a fabric manager
@yhlong00000 Can you check machine
nang0aeab9ed
for this ECC error on the host?
Edit: fmylxpa4t5so
as well?the thing that manages commm between gpus ig?
https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
Wow that's complicated things
google's TPUs are more insane 😆
it's really frustrating too
TPUs are easy versus Cerebras crap
wait.. what?
i heard they made a wafer level chip
maybe wafer sized but idk
I've unlisted the machine for now and will have infra team check further.
well i went to cu128 container as recommended and one of the GPUs still gives ECC error i think
Can you share the pod id? I'm still waiting on a followup from infra
seems like it happens to alot of machines
It shouldn't 😧
just curious how many gpu's were you using, and which type?
The GPUs on both machines (this thread and it's sibling) have been reset, they should be good to go :)
wow super fast support
1 business day delay time 🙂 (obviously excluding saturday / sunday)
Even faster if the DC engineer can stay up 24h hahah
Thats brutal lol
And cloudfront seems to have issues so can you plz chech that
I don't think I can do anything about that 😠What do you mean?
Maybe the proxy, cloudflare
Yeah
3 ppl had issues
Perhaps you're referring to this? https://discord.com/channels/912829806415085598/1360254321614127145
Http proxy being so slow
They said it got modem era speeds, 100kbps
Oh okay, Cloudfront is a different company that Cloudflare I was very confused
Wow
Why did i say cloudfront
I have a meeting right now but I'll be back soon to look
Hahah you use aws alot?
Yeah
Its 12:00here so probably lack of sleep issues
Yup good night
Based on the information you’ve provided, we don’t believe this is an issue with the RunPod platform. Unfortunately, without more specific GPU details, our reliability team is unable to investigate further or escalate the matter to our hardware vendor. We’ve also reviewed our monitoring metrics and didn’t observe any ECC errors in the past 30 days, as I mentioned earlier.:maaaaan: i can't even believe the audacity of the support team to respond like this, do you guys want us wasting money running broken pods? really? blaming me even when these threads here around the same time were indicating it was a shared issue by more than one user :maaaaan:
Can you share your pod ID? It is technically a fault of our platform that we distributed you onto this hardware while it is in this state.
It's very likely you were placed onto the physical server this thread was created for and it's something I am personally actively working on with our hardware partner, they're sending this server back online without doing the proper work. We delist it, they say "it's fixed :)" and send it back, and the same server has a problem reported. Naturally this is absolutely not the experience we want you to have.
I'm not particularly weekend support, so I'm not home right now but I'll find your ticket based on the response you got and look into this and follow up with you later tonight.
@bghira
it's this one but it's gone now
i know how this feels. i have a free 8xH100 system that "fell out of a billing system" and its 100gbps port flaps so it's like, cool, i guess, but also kinda useless. i told the vendor. they haven't done anything in ~2 weeks
Logs are forever (14 days), I'll find it :)
Okay, yeah with the context of knowing you were one of the first reports this is absolutely a misfire from support. The machine in question is the problematic machine that keeps getting relisted by its host.