Uncorrectable ECC error encountered

Recently I'm getting many "Uncorrectable ECC error encountered" errors on H200 and H100 instances (all that I've tried). I always run a GPU health check first, the 4x H200 pods that I've tried usually don't pass here. An 8x H100 instance did pass there, but then failed during axolotl finetuning with this error. Any ideas why this might be happening all of a sudden?
No description
No description
2 Replies
yhlong00000
yhlong000004w ago
Hey, do you mind to open a support ticket include the pod id. We can take a look
Poddy
Poddy4w ago
@nielsrolf
Escalated To Zendesk
The thread has been escalated to Zendesk!

Did you find this page helpful?