Strange behaviour with multiple pods running at the same time

Hi,

I have a few pods running all in parallel. A couple on 4090s and one on 5090.
I am training a lora on the 5090 while doing general work on 4090s.

Now I have observed a couple of strange things.
First one is that at some point, my 5090 node was showing very low ressources utilization, 0% GPU etc, but I could see the training still going strong, and generating samples etc. And one of the 4090 node that was idling was actually working at 100%, even if nothing was being done on it. nvidia-smi showed that ai-toolkit was taking the ressources (despite not being used there).

Second one is, I have rebooted my 5090 node. I am trying to resume a training job, and I am getting a CUDA out of memory error. "GPU 0 has a total capacity of 23.53 GiB of which 52.38 MiB is free. ". While I am clearly on a 5090, and it shows that I have 32GB on it.

Is that some known behaviour?

Runpod•4mo ago•

10 replies

Dara

Strange behaviour with multiple pods running at the same time

Strange behaviour with multiple pods running at the same time

Similar Threads

Strange behaviour with multiple pods running at the same time

Similar Threads

Similar Threads

Similar Threads