Serverless FAILING to add Workers

I have a queue-based endpoint created & i have 4 requests in the pipeline. It's been over 30-40 mins and Serverless has failed to recruit any new H100 worker for me. I don't have any data-centers (regions) specified.
No description
34 Replies
Immar K
Immar KOP2w ago
Why is this happening? here are my endpoint settings:
No description
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
nothing of sort just plain GPU
Immar K
Immar KOP2w ago
also, I have sufficient balance ....
No description
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
refreshed it dozens of times 🙁
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
No description
Immar K
Immar KOP2w ago
No volume or specified region
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
network tab:
No description
David
David2w ago
@flash-singh no workers, is this our setup configuration problem or is it runpod's capacity problem? thanks.
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Poddy
Poddy2w ago
@Immar K
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #24966
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
yeah, we're 3 hours past and nothing yet 🙁 I have been trying a bunch of stuff but nothing seems to have worked for me also, what's more strange is this warning: Currently 100% of your max workers are busy. Consider increasing your max workers to 7 to handle higher demand and improve performance. and for some reason it doesn't let me go beyond 5 Max workers. If i write 6 or 7, it rewrites it back to 5. some check on the frontend maybe. not sure
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
i have ~ $100 in the account with $80/hr limit
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
I am sorry what does that mean? how's that possible it shows a single worker from your ss
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
No description
Immar K
Immar KOP2w ago
it shows 5/5 which is not true
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
I see, it says you can not have > 5 Max workers with balance under $100 regardless, it says 5/5 workers deployed - which i beleive is not true since I don't see anything in the workers tab.
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
sure, yes it's getting too cluttered here Ticket created. looking at another strange behaviour : an instance came up but comfyUI failed to start (likely a hardware issue since i am running latest cuda version): CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\n torch.AcceleratorError: CUDA error: CUDA-capable device(s) is/are busy or unavailable return torch.cuda.cudart().cudaMemGetInfo(device)\n File "/root/ComfyUI/venv/lib/python3.10/site-packages/torch/cuda/memory.py", line 838, in mem_get_info\n mem_total_cuda = torch.cuda.mem_get_info(dev)\n i'll wait for support to get back to me
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
Immar K
Immar KOP2w ago
No, i don't have any filters. my template is not 12.9, i have ran my applicaiton in older cuda versions as well on other platforms, etc
Yueqi
Yueqi2w ago
Hi! I am getting the same issue with using 80GB pro gpus. Even when creating new endpoints with 48GB gpus it's not loading the model / running any generations.
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
byteripper
byteripper2w ago
Same issue here, my new (and only) endpoint is in "Initializing" state for the last 3 hours, without any workers.
Unknown User
Unknown User2w ago
Message Not Public
Sign In & Join Server To View
byteripper
byteripper2w ago
I recreated my endpoint and it seems to be working again now

Did you find this page helpful?