The GPU usage was at 100% in the GUI, too, so I thought it was doing work.
- ERR! in Fan column indicates HARDWARE FAILURE
- 100% GPU Utilization with ZERO running processes
- GPU consuming 140W power while doing NOTHING
- Only 1MiB/46068MiB memory used but GPU stuck at 100%
[rank0]: RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
I just stopped the pod if you guys need to investigate it I am not terminating it, you can ask for more details like the pod id here