Runpod•2mo ago

Bricked A40 GPU at 100% utilisation and nvidia-smi error at launch

I’m trying to run a cluster of 4 x A40s for mode finetuning. When launching an instance, exactly one of the gpus is always at 100% utilisation with nvidia-smi showing it having an error. I’ve tried making new clusters multiple times but each time one gpu is broken. Can I somehow avoid this bricked gpu, or replace it? Thanks

1 Reply

Dj•2mo ago

If you can provide a pod id (or, bonus points for the GPU ID) we can have the GPU repaired and the problematic node delisted.

Gaming

Programming

Bricked A40 GPU at 100% utilisation and nvidia-smi error at launch

Did you find this page helpful?