Uncorrectable ECC error encountered
Recently I'm getting many "Uncorrectable ECC error encountered" errors on H200 and H100 instances (all that I've tried). I always run a GPU health check first, the 4x H200 pods that I've tried usually don't pass here. An 8x H100 instance did pass there, but then failed during axolotl finetuning with this error. Any ideas why this might be happening all of a sudden?


2 Replies
Hey, do you mind to open a support ticket include the pod id. We can take a look
@nielsrolf
Escalated To Zendesk
The thread has been escalated to Zendesk!