I’m trying to run a cluster of 4 x A40s for mode finetuning. When launching an instance, exactly one of the gpus is always at 100% utilisation with nvidia-smi showing it having an error. I’ve tried making new clusters multiple times but each time one gpu is broken. Can I somehow avoid this bricked gpu, or replace it? Thanks