Not all workers being utilized
In the attached image if you can see 11/12 workers spun up but only 7 are being utilized but we're being charged for all the 12 GPUs. @girishkd

18 Replies
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
If you see "Jobs" section, 7 in progress it shows. So, it is not utilizing all the GPUs to serve the requests. Only 7 are serving the requests
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
its just a SDXL model
In some of the GPUs, CUDA failure was seen and those GPUs when we remove from the list of workers, they are not spinning up
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
we limited the cuda versions to 12.1
me and @girishkd are colleagues
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
nope nope
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
What kind of CUDA failure? Did it OOM for running out of VRAM?
I've seen that happen on 24GB GPUs when you add upscaling.
Attached screenshot contains the CUDA failure we are experiencing

We are using 24GB ones (4090s) only
Oh yeah that error seems to be due to a broken worker.
Okay. These broken workers are not getting respawned on its own. What should we do in that case ?
Contact RunPod support via web chat or email
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
Yeah it happens sometimes just like broken pods. I had to terminate workers a few times.