Workers are getting throttled
Hey guys,
Workers are getting throttled. I have 50 workers limit and most of them are getting throttled. My application is being impacted havily.
For a note, its mostly happening for US based workers. I have no preferences around GPU or CUDA so its starting worker randomly across the globe.
35 Replies
Hey @Jaya do you have an endpoint id you can share?
I have many endpoints. One of them is
https://console.runpod.io/serverless/user/endpoint/2nbf2hwyiledzn
@Jaya what i see is that your GPU is only set to 4090, or the 24gb pro, if you are okay with doing so / your application can handle it, i can recommend to also maybe allow for a 24gb in your selection for endpoint.
4090s are quite popular and can be eaten up. You can also request for a larger worker max if you feel that is helpful to increase your workers across your endpoints 🙂
Ok can you help me with a large worker max? I will also have 24gb has secondary but honestly speaking its just started since last couple of days
It wasnt that bad so far
@Jaya i increased your max workers to 75, to give you a bit more wiggle room on any critical endpoints so feel free to bump that up, if you need even more workers, you can fill out a hubspot form, in the UI at your workers amount if you need more you can click it, and it will bring you to a form to update even more
Thanks Justin
Can confirm that it doesn't look good last several days (was fine before). I also use 4090s

Hey 🖐
I'm experiencing the same issue. Almost all workers on 4090 are showing as throttled with Low Supply.
The 5090s are completely unavailable.
Just to confirm, this is without region restrictions?
Is this also no region restrictions?
Yes. Region doesn't matter.
Yeah, 4090s and 5090s might just be being used right now popular GPUs. but raised it to the team to see if any further concerns, but this can happen especially if everyone is concentrating on these two GPU types
Thanks for answer. Please help. Because GPUs don't work at all.
Constant throttled.
Can you share with me the endpoint? is your entire endpoint throttled?
It should still have left some to use
The entire endpoint is throttled.
I think the id is not important. You can create a point yourself now and check that 4090 is almost unavailable. 5090 is completely unavailable.
I see that workers are initialized, but then immediately become throttled.
Will test and raise to the team, thanks for reporting
Reported to the teeam @PotapovS / @Xeverian / @Jaya is being tracked + worked on fyi
Thanks for helping to report the issue
This does not seem to be improving @justin (New) [Staff Not Staff] . I have also a 48 GB A40, A6000 but there is none which can take place

Maybe can set the other 48gb as higher priority for now in the menu, but yes, I’ve already raised this as a high priority and they identified the potential issue and are rolling some stuff back and changes are currently underway
It should be better now today FYI: 4090s and 5090s issue were resolved yesterday night it seemed. Thanks again to everyone for pointing out the issue.
I confirm. The problems have been resolved.
Thank you and the team for your help!
Now there is another issue. Requests are taking way too long time to move from in queue to in progress.
And this time I have it as H100. It seems the problematic ones are those running in North America and are unable to reserve GPUs
It seems its a cuda error. I have no preferences around cuda. This has been giving lot of trouble recently.
Sorry to hear, feel free to create a support ticket:
https://contact.runpod.io/hc/en-us/requests/new
And will be more easily escalted to the right team / points of contacts who can look deeper into it
Hey 🖐
Problem with throttled workers back.
All regions, all cuda versions. 5090 constantly falls off and throttled.
I can confirm its back with 4090 as well
5090 completely unusable
Hey 🖐
Problem with throttled workers back again.
All regions, all cuda versions. 5090 constantly falls off and throttled.
Thank you, will be taking a look and raising to the team
Thanks have confirmed - there is extremely high usage right now from someone eating up GPUs. Flagged to the team
5090's are super scuffed rn
Yes; to give an update we just have extremely high utilization right now from all customers maxing out the data-centers. Team is planning to increase capacity in upcoming 1-2 weeks since takes time to get physical hardware online
Not a single one of my endpoints have a 5090 available 🫠🫠
I want to comment on the throttling issue. For us it started a few days ago, and it is always happening around 11am EST. It feels like someone starts running something big, pushing everyone else out. We tried data centers in Iceland and Romania, it is all the same.
This is back again in last few days. Most of the workers getting killed for 4090. Also, some times it happens with H100 as well
This is highly disappointing.
It seems you cant run a business with assumptions you will get serverless GPUs on runpod.
Throughout this week we've been running emergency maintenance and the users most affected are those running serverless workloads with popular/low cost GPUs. Where we may have a surplus of a specific GPU, we have to delist those machines to perform work on them. We are obligated to perform this maintenance across the fleet and only ask for your patience until it's done and we can disclose the reason.
Your reason is absolutely justified but you have to consider this fact that other businesses are relying on GPUs provided by you.
: (

😢

and it's 4090 in EU-RO-1, the best one for those cards