RunpodR
Runpod2y ago
2 replies
MarioHachemer

CUDA error: uncorrectable ECC error encountered

I just provisioned an 8xH100 NVL machine, made it load a very large model and then the container got stuck into a restart loop trying to load the model stuck on this error:

2024-08-04T16:43:13.809833249Z RuntimeError: CUDA error: uncorrectable ECC error encountered

This looks like a hardware defect. Is there a way to get my credits back for that run?
Was this page helpful?