Not all workers being utilized

In the attached image if you can see 11/12 workers spun up but only 7 are being utilized but we're being charged for all the 12 GPUs. @girishkd
No description
18 Replies
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
harishp
harishpOP2y ago
If you see "Jobs" section, 7 in progress it shows. So, it is not utilizing all the GPUs to serve the requests. Only 7 are serving the requests
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
harishp
harishpOP2y ago
its just a SDXL model
girishkd
girishkd2y ago
In some of the GPUs, CUDA failure was seen and those GPUs when we remove from the list of workers, they are not spinning up
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
harishp
harishpOP2y ago
we limited the cuda versions to 12.1 me and @girishkd are colleagues
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
harishp
harishpOP2y ago
nope nope
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
digigoblin
digigoblin2y ago
What kind of CUDA failure? Did it OOM for running out of VRAM? I've seen that happen on 24GB GPUs when you add upscaling.
girishkd
girishkd2y ago
Attached screenshot contains the CUDA failure we are experiencing
No description
girishkd
girishkd2y ago
We are using 24GB ones (4090s) only
digigoblin
digigoblin2y ago
Oh yeah that error seems to be due to a broken worker.
girishkd
girishkd2y ago
Okay. These broken workers are not getting respawned on its own. What should we do in that case ?
digigoblin
digigoblin2y ago
Contact RunPod support via web chat or email
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
digigoblin
digigoblin2y ago
Yeah it happens sometimes just like broken pods. I had to terminate workers a few times.

Did you find this page helpful?