R
RunPod•4mo ago
jg

ECC errors on serverless workers using L4

We are currently using L4 machines in the eu-ro region for our production environment(30~70 workers). Based on the requests data, we have seen increasing hardware issues related to ECC errors and was wondering if we could get help in mitigating these failures.
"handler: CUDA error: uncorrectable ECC error encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
"handler: CUDA error: uncorrectable ECC error encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Based on the "Requests" data from our endpoint, we see that the failures have increased starting from 2024.02.02. We do have a couple of questions, but ultimately, it would be great if we were provided some guidance to fully handling the failing requests. - Are we expected to terminate the instance with this issue? - Is there a way to handle this from the code (and not having to do it manually) - Difference between "terminate" and "refresh" - It seems that after terminating a worker that had an uncorrected ECC issue, a new pod is respawned on the same machine. Is there a way to avoid this - For example, the machine with the ID x4udv5lkhl7d was still getting assigned pods even after terminating workers - Any recommendations on monitoring for these occurrences in the workers we use (especially for those used in production)
6 Replies
flash-singh
flash-singh•4mo ago
we have a uncorrectable ecc check already built-in, will look into that and see why that one server isn't being flagged
jg
jg•4mo ago
thanks! we keep seeing this particular machine (x4udv5lkhl7d) with ECC errors
ashleyk
ashleyk•4mo ago
Did you try terminating the worker? I usually terminate the worker when this kind of thing happens.
jg
jg•4mo ago
We've tried terminating, but at some later point in time, some of our workers get spawned on the same machine that has been throwing ECC errors. Even after refreshing, the machine might recover, but this as well fails after some time. @flash-singh I know you guys might be on holiday but do you have any updates for us?
flash-singh
flash-singh•4mo ago
this is worker id? i was able to find the gpu causing this, we were checking for ecc.errors.uncorrected.volatile.total, while thats 0, ecc.errors.uncorrected.aggregate.total shows a high number of faults https://gist.github.com/sansmoraxz/8a98d987f12d7edc983d611b8326fc67 will have to roll an update to start flagging gpus with those errors this is solved now, took the server out of the pool
jg
jg•4mo ago
Awesome! Thank you very much for the help. We're seeing no failures so far from our endpoint in production đź‘Ť