Strange behaviour with multiple pods running at the same time
Hi,
I have a few pods running all in parallel. A couple on 4090s and one on 5090.
I am training a lora on the 5090 while doing general work on 4090s.
Now I have observed a couple of strange things.
First one is that at some point, my 5090 node was showing very low ressources utilization, 0% GPU etc, but I could see the training still going strong, and generating samples etc. And one of the 4090 node that was idling was actually working at 100%, even if nothing was being done on it. nvidia-smi showed that ai-toolkit was taking the ressources (despite not being used there).
Second one is, I have rebooted my 5090 node. I am trying to resume a training job, and I am getting a CUDA out of memory error. "GPU 0 has a total capacity of 23.53 GiB of which 52.38 MiB is free. ". While I am clearly on a 5090, and it shows that I have 32GB on it.
Is that some known behaviour?
I have a few pods running all in parallel. A couple on 4090s and one on 5090.
I am training a lora on the 5090 while doing general work on 4090s.
Now I have observed a couple of strange things.
First one is that at some point, my 5090 node was showing very low ressources utilization, 0% GPU etc, but I could see the training still going strong, and generating samples etc. And one of the 4090 node that was idling was actually working at 100%, even if nothing was being done on it. nvidia-smi showed that ai-toolkit was taking the ressources (despite not being used there).
Second one is, I have rebooted my 5090 node. I am trying to resume a training job, and I am getting a CUDA out of memory error. "GPU 0 has a total capacity of 23.53 GiB of which 52.38 MiB is free. ". While I am clearly on a 5090, and it shows that I have 32GB on it.
Is that some known behaviour?