Uncorrectable ECC error encountered

Recently I'm getting many "Uncorrectable ECC error encountered" errors on H200 and H100 instances (all that I've tried). I always run a GPU health check first, the 4x H200 pods that I've tried usually don't pass here. An 8x H100 instance did pass there, but then failed during axolotl finetuning with this error. Any ideas why this might be happening all of a sudden?
Screenshot_2025-04-06_at_13.04.18.png
Screenshot_2025-04-06_at_13.06.51.png
Was this page helpful?