Bricked A40 GPU at 100% utilisation and nvidia-smi error at launch
I’m trying to run a cluster of 4 x A40s for mode finetuning. When launching an instance, exactly one of the gpus is always at 100% utilisation with nvidia-smi showing it having an error. I’ve tried making new clusters multiple times but each time one gpu is broken. Can I somehow avoid this bricked gpu, or replace it? Thanks
1 Reply
If you can provide a pod id (or, bonus points for the GPU ID) we can have the GPU repaired and the problematic node delisted.