R
Runpod•6mo ago
41see

CUDA device uncorrectable ECC error

I'm using a 5xH100 pod and got uncorrectable ECC error for device 1,2,3. Device 0 and 4 can be used without a problem. It seems the device or the system needs a reboot. Any help on this? I've already submitted a ticket on the website with the pod id. Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import torch torch.cuda.device_count() 5 torch.tensor([1], device='cuda:0') tensor([1], device='cuda:0') torch.tensor([1], device='cuda:1') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:2') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:3') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:4') tensor([1], device='cuda:4')
56 Replies
Dj
Dj•6mo ago
Hey, can you share your pod id or dm me your account email?
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
bghira
bghira•6mo ago
actually same error here occuring on H100 pods. (secure cloud) @Dj pod id is hs6vtqnu343wwj
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
bghira
bghira•6mo ago
Cu124 it's H100s. it's not going to be 11.8. 🙂
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
bghira
bghira•6mo ago
it's not the problem.. also, 12.8 will need an entirely different version of pytorch. that isn't something someone can do in production
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
bghira
bghira•6mo ago
see the original post in this thread 🫠 support looking at the ticket with the pod ID. i'll update here if they give any more insight into the reason at my org we have something like 200-300 H100s and i've seen CUDA permissions error or NVML init error, but never this kinda ongoing ECC error, it's new to me, so i'm curious too
Dj
Dj•6mo ago
I'm also taking a look, we had a small outage yesterday which maybe related but I'm working on going through the relevant logs (similiar to what support would be doing)
bghira
bghira•6mo ago
been there. i understand thanks for taking a look
riverfog7
riverfog7•6mo ago
can confirm im dying waiting for cuda kernels to compile
Dj
Dj•6mo ago
Are you on the same org or just also seeing the same error?
bghira
bghira•6mo ago
different case
riverfog7
riverfog7•6mo ago
my experience with H100s were fine (cuz i dont use them a lot)
bghira
bghira•6mo ago
this is a pretty rare issue for SXM5 systems but i'm on H100 PCIe which are a lot more "meh" from past experience (not a RunPod problem, it's an NVIDIA problem)
riverfog7
riverfog7•6mo ago
that was about making vllm to work with blackwell
Dj
Dj•6mo ago
Just got freed from my meeting, hunting down the error now
riverfog7
riverfog7•6mo ago
i think someone got nvlink errors on H100s in other thread
bghira
bghira•6mo ago
that happens if the host system has the fabric manager crash
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog7•6mo ago
yeah mentioned this issue there
bghira
bghira•6mo ago
must be the outage mentioned from yesterday you know the other annoying thing is if the OS auto-updates the nvidia drivers :KEKLEO:
riverfog7
riverfog7•6mo ago
lol
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
Dj
Dj•6mo ago
@yhlong00000 Can you check machine nang0aeab9ed for this ECC error on the host? Edit: fmylxpa4t5so as well?
riverfog7
riverfog7•6mo ago
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog7•6mo ago
google's TPUs are more insane 😆
bghira
bghira•6mo ago
it's really frustrating too TPUs are easy versus Cerebras crap
riverfog7
riverfog7•6mo ago
wait.. what? i heard they made a wafer level chip maybe wafer sized but idk
yhlong00000
yhlong00000•6mo ago
I've unlisted the machine for now and will have infra team check further.
bghira
bghira•6mo ago
well i went to cu128 container as recommended and one of the GPUs still gives ECC error i think
Dj
Dj•6mo ago
Can you share the pod id? I'm still waiting on a followup from infra
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
Dj
Dj•6mo ago
It shouldn't 😧
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
Dj
Dj•6mo ago
The GPUs on both machines (this thread and it's sibling) have been reset, they should be good to go :)
riverfog7
riverfog7•6mo ago
wow super fast support 1 business day delay time 🙂 (obviously excluding saturday / sunday)
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog7•6mo ago
Thats brutal lol And cloudfront seems to have issues so can you plz chech that
Dj
Dj•6mo ago
I don't think I can do anything about that 😭 What do you mean?
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog7•6mo ago
Yeah 3 ppl had issues
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog7•6mo ago
Http proxy being so slow They said it got modem era speeds, 100kbps
Dj
Dj•6mo ago
Oh okay, Cloudfront is a different company that Cloudflare I was very confused
riverfog7
riverfog7•6mo ago
Wow Why did i say cloudfront
Dj
Dj•6mo ago
I have a meeting right now but I'll be back soon to look
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog7•6mo ago
Yeah Its 12:00here so probably lack of sleep issues
Unknown User
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
bghira
bghira•6mo ago
Based on the information you’ve provided, we don’t believe this is an issue with the RunPod platform. Unfortunately, without more specific GPU details, our reliability team is unable to investigate further or escalate the matter to our hardware vendor. We’ve also reviewed our monitoring metrics and didn’t observe any ECC errors in the past 30 days, as I mentioned earlier.
:maaaaan: i can't even believe the audacity of the support team to respond like this, do you guys want us wasting money running broken pods? really? blaming me even when these threads here around the same time were indicating it was a shared issue by more than one user :maaaaan:
Dj
Dj•6mo ago
Can you share your pod ID? It is technically a fault of our platform that we distributed you onto this hardware while it is in this state. It's very likely you were placed onto the physical server this thread was created for and it's something I am personally actively working on with our hardware partner, they're sending this server back online without doing the proper work. We delist it, they say "it's fixed :)" and send it back, and the same server has a problem reported. Naturally this is absolutely not the experience we want you to have. I'm not particularly weekend support, so I'm not home right now but I'll find your ticket based on the response you got and look into this and follow up with you later tonight. @bghira
bghira
bghira•6mo ago
it's this one but it's gone now i know how this feels. i have a free 8xH100 system that "fell out of a billing system" and its 100gbps port flaps so it's like, cool, i guess, but also kinda useless. i told the vendor. they haven't done anything in ~2 weeks
Dj
Dj•6mo ago
Logs are forever (14 days), I'll find it :) Okay, yeah with the context of knowing you were one of the first reports this is absolutely a misfire from support. The machine in question is the problematic machine that keeps getting relisted by its host.

Did you find this page helpful?