Lost GPUs mid-run
I was running on a 5090 pod with 3 GPUs (that's what was available to it). Mid-run my software complained that there are no CUDA GPUs. After stopping the app I tried nvidia-smi and got "Failed to initialize NVML: Unknown Error". That's never happened before.
Solution
Restarting helped. So just FYI. I also opened a ticked about this with the pod ID.