R
Runpod2mo ago
Gosen

Bricked A40 GPU at 100% utilisation and nvidia-smi error at launch

I’m trying to run a cluster of 4 x A40s for mode finetuning. When launching an instance, exactly one of the gpus is always at 100% utilisation with nvidia-smi showing it having an error. I’ve tried making new clusters multiple times but each time one gpu is broken. Can I somehow avoid this bricked gpu, or replace it? Thanks
1 Reply
Dj
Dj2mo ago
If you can provide a pod id (or, bonus points for the GPU ID) we can have the GPU repaired and the problematic node delisted.

Did you find this page helpful?