Runpod•6mo ago

CUDA device uncorrectable ECC error

I'm using a 5xH100 pod and got uncorrectable ECC error for device 1,2,3. Device 0 and 4 can be used without a problem. It seems the device or the system needs a reboot. Any help on this? I've already submitted a ticket on the website with the pod id. Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import torch torch.cuda.device_count() 5 torch.tensor([1], device='cuda:0') tensor([1], device='cuda:0') torch.tensor([1], device='cuda:1') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:2') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:3') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:4') tensor([1], device='cuda:4')

56 Replies

Dj•6mo ago

Hey, can you share your pod id or dm me your account email?

Unknown User•6mo ago

Message Not Public

bghira•6mo ago

actually same error here occuring on H100 pods. (secure cloud) @Dj pod id is hs6vtqnu343wwj

Unknown User•6mo ago

Message Not Public

bghira•6mo ago

Cu124 it's H100s. it's not going to be 11.8. 🙂

Unknown User•6mo ago

Message Not Public

bghira•6mo ago

it's not the problem.. also, 12.8 will need an entirely different version of pytorch. that isn't something someone can do in production

Unknown User•6mo ago

Message Not Public

bghira•6mo ago

see the original post in this thread 🫠 support looking at the ticket with the pod ID. i'll update here if they give any more insight into the reason at my org we have something like 200-300 H100s and i've seen CUDA permissions error or NVML init error, but never this kinda ongoing ECC error, it's new to me, so i'm curious too

Dj•6mo ago

I'm also taking a look, we had a small outage yesterday which maybe related but I'm working on going through the relevant logs (similiar to what support would be doing)

bghira•6mo ago

been there. i understand thanks for taking a look

riverfog7•6mo ago

can confirm im dying waiting for cuda kernels to compile

Dj•6mo ago

Are you on the same org or just also seeing the same error?

bghira•6mo ago

different case

riverfog7•6mo ago

my experience with H100s were fine (cuz i dont use them a lot)

bghira•6mo ago

this is a pretty rare issue for SXM5 systems but i'm on H100 PCIe which are a lot more "meh" from past experience (not a RunPod problem, it's an NVIDIA problem)

riverfog7•6mo ago

that was about making vllm to work with blackwell

Dj•6mo ago

Just got freed from my meeting, hunting down the error now

riverfog7•6mo ago

i think someone got nvlink errors on H100s in other thread

bghira•6mo ago

that happens if the host system has the fabric manager crash

Unknown User•6mo ago

Message Not Public

riverfog7•6mo ago

yeah mentioned this issue there

bghira•6mo ago

must be the outage mentioned from yesterday you know the other annoying thing is if the OS auto-updates the nvidia drivers :KEKLEO:

riverfog7•6mo ago

lol

Unknown User•6mo ago

Message Not Public

Dj•6mo ago

@yhlong00000 Can you check machine nang0aeab9ed for this ECC error on the host? Edit: fmylxpa4t5so as well?

riverfog7•6mo ago

the thing that manages commm between gpus ig? https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html

Unknown User•6mo ago

Message Not Public

riverfog7•6mo ago

google's TPUs are more insane 😆

bghira•6mo ago

it's really frustrating too TPUs are easy versus Cerebras crap

riverfog7•6mo ago

wait.. what? i heard they made a wafer level chip maybe wafer sized but idk

yhlong00000•6mo ago

I've unlisted the machine for now and will have infra team check further.

bghira•6mo ago

well i went to cu128 container as recommended and one of the GPUs still gives ECC error i think

Dj•6mo ago

Can you share the pod id? I'm still waiting on a followup from infra

Unknown User•6mo ago

Message Not Public

Dj•6mo ago

It shouldn't 😧

Unknown User•6mo ago

Message Not Public

Dj•6mo ago

The GPUs on both machines (this thread and it's sibling) have been reset, they should be good to go :)

riverfog7•6mo ago

wow super fast support 1 business day delay time 🙂 (obviously excluding saturday / sunday)

Unknown User•6mo ago

Message Not Public

riverfog7•6mo ago

Thats brutal lol And cloudfront seems to have issues so can you plz chech that

Dj•6mo ago

I don't think I can do anything about that 😭 What do you mean?

Unknown User•6mo ago

Message Not Public

riverfog7•6mo ago

Yeah 3 ppl had issues

Unknown User•6mo ago

Message Not Public

riverfog7•6mo ago

Http proxy being so slow They said it got modem era speeds, 100kbps

Dj•6mo ago

Oh okay, Cloudfront is a different company that Cloudflare I was very confused

riverfog7•6mo ago

Wow Why did i say cloudfront

Dj•6mo ago

I have a meeting right now but I'll be back soon to look

Unknown User•6mo ago

Message Not Public

riverfog7•6mo ago

Yeah Its 12:00here so probably lack of sleep issues

Unknown User•6mo ago

Message Not Public

bghira•6mo ago

Based on the information you’ve provided, we don’t believe this is an issue with the RunPod platform. Unfortunately, without more specific GPU details, our reliability team is unable to investigate further or escalate the matter to our hardware vendor. We’ve also reviewed our monitoring metrics and didn’t observe any ECC errors in the past 30 days, as I mentioned earlier.

:maaaaan: i can't even believe the audacity of the support team to respond like this, do you guys want us wasting money running broken pods? really? blaming me even when these threads here around the same time were indicating it was a shared issue by more than one user :maaaaan:

Dj•6mo ago

Can you share your pod ID? It is technically a fault of our platform that we distributed you onto this hardware while it is in this state. It's very likely you were placed onto the physical server this thread was created for and it's something I am personally actively working on with our hardware partner, they're sending this server back online without doing the proper work. We delist it, they say "it's fixed :)" and send it back, and the same server has a problem reported. Naturally this is absolutely not the experience we want you to have. I'm not particularly weekend support, so I'm not home right now but I'll find your ticket based on the response you got and look into this and follow up with you later tonight. @bghira

bghira•6mo ago

it's this one but it's gone now i know how this feels. i have a free 8xH100 system that "fell out of a billing system" and its 100gbps port flaps so it's like, cool, i guess, but also kinda useless. i told the vendor. they haven't done anything in ~2 weeks

Dj•6mo ago

Logs are forever (14 days), I'll find it :) Okay, yeah with the context of knowing you were one of the first reports this is absolutely a misfire from support. The machine in question is the problematic machine that keeps getting relisted by its host.

Gaming

Programming

CUDA device uncorrectable ECC error

Did you find this page helpful?