Serverless throttled
Hi! Since yesterday I can't run my serverless endpoint - I'm constantly being throttled or given unhealthy workers. Can we do something to make it work?
Solution:Jump to solution
I believe I've spoken to all of you in a mixture of other threads and in the general channel - but sharing this for visibility:
Throughout this week we've been running emergency maintenance and the users most affected are those running serverless workloads with popular GPUs. Where we may have a surplus of a specific GPU, we have to delist the machines that host the GPUs (where it's up to 8 GPUs per machine) to perform work on them.
We are obligated to perform this maintenance across the fleet and only ask for your patience until it's done and we can disclose the reason....
24 Replies
This is usually supply constraints, can you share your endpoint id I'll help look?
https://console.runpod.io/serverless/user/endpoint/zzzxbb6p6pogvm?I’m facing significant supply constraints right now, so that’s probably the issue.
Hey guys, hi! I'm having ongoing throttling issues with several serverless endpoints in Runpod (thfv8pa98n0zmx, 3uo2k0k7717auu, 9o42o47k1v1wn)—they've been stuck for the second day now and it's disrupting work. Which section/channel should I post a detailed support request to get a quick response?
Thank you all, the data helps significantly.
We are restoring a small percent of servers that were scheduled for maintenance to recover lost capacity.
Hey! I have the same issue. Even tho I have enabled active workers that my production service was relying upon, they have been throttled and my service became unreliable. I lose clients guys.
Here how it looks, but I have 2 active workers enabled.
But Runpod doesn't give them to me, and I saw it having 0 workers. I think it's horrible.


RTX 4090 is completely unstable at the moment
Solution
I believe I've spoken to all of you in a mixture of other threads and in the general channel - but sharing this for visibility:
Throughout this week we've been running emergency maintenance and the users most affected are those running serverless workloads with popular GPUs. Where we may have a surplus of a specific GPU, we have to delist the machines that host the GPUs (where it's up to 8 GPUs per machine) to perform work on them.
We are obligated to perform this maintenance across the fleet and only ask for your patience until it's done and we can disclose the reason.
When it will be finished?
We're doing small subsets of machines daily until Friday.
why didnt we got informed beforehand but only when the shit hits the fan?
That's an odd behaviour tbh. and not how clients should get treated no matter you are obligated to or not. Don't you also have an obligation to us, your paying customers? 😄
We were only capable of informing users with persistent Pods due to the nature of our delisting process. When we mark a machine for maintenence, it emails anyone running a Pod on the host. For example, even my own customer account which only runs serverless workloads didn't get an email about this.
While I also want reform for this process, our hands are tied due to the nature of the problem. I truly really can't tell you the reason until we're done.
Thank you for your transparency. yes, same here. only serverless and thsi created lots of issues throughout today.
OK, thank you for the information. Even though I would have loved to be notified without figuring it out on my own (we had half a day of near downtime before migrating :/) I see that it wasn't possible in this case. Good to hear that you're working on improving the communications
It's a little last minute for us too, and any process improvements I suggest just can't be applied yet because of the timeline we're obligated to here.
We'll look to making this less uncomfortable in the future, we have a few ideas already.
Same for me, and can't select other gpu cause other is low

So will the problem still be relevant tomorrow?
All of our scheduled maintenance for tomorrow is postponed.
Hi.
How long are they being postponed?
When will at least minimal work for 4090 be restored?
Hello, ive also been running into similar issues, so i've added more GPU options 32, 80, and 141 GB however my rollout has been stuck in this state for the past 3-4 hours. endpoint is unusuable. any way to get it rolled out?

When can we expect a solution to this problem, and what should we do if our business relies on daily use of serverless?
I thought we are going to be stable after Friday.
Is it no longer true? Then it seems we need to move to another provider.
Can you recommend good providers?
It's already Saturday, but work still hasn't returned to normal. How much longer do we have to wait?