Serverless throttled

Hi! Since yesterday I can't run my serverless endpoint - I'm constantly being throttled or given unhealthy workers. Can we do something to make it work?
Solution:
I believe I've spoken to all of you in a mixture of other threads and in the general channel - but sharing this for visibility: Throughout this week we've been running emergency maintenance and the users most affected are those running serverless workloads with popular GPUs. Where we may have a surplus of a specific GPU, we have to delist the machines that host the GPUs (where it's up to 8 GPUs per machine) to perform work on them. We are obligated to perform this maintenance across the fleet and only ask for your patience until it's done and we can disclose the reason....
Jump to solution
24 Replies
Dj
Dj7d ago
This is usually supply constraints, can you share your endpoint id I'll help look?
Ginterhauser
GinterhauserOP7d ago
https://console.runpod.io/serverless/user/endpoint/zzzxbb6p6pogvm?
sharif
sharif7d ago
I’m facing significant supply constraints right now, so that’s probably the issue.
palladinEA
palladinEA7d ago
Hey guys, hi! I'm having ongoing throttling issues with several serverless endpoints in Runpod (thfv8pa98n0zmx, 3uo2k0k7717auu, 9o42o47k1v1wn)—they've been stuck for the second day now and it's disrupting work. Which section/channel should I post a detailed support request to get a quick response?
Dj
Dj7d ago
Thank you all, the data helps significantly. We are restoring a small percent of servers that were scheduled for maintenance to recover lost capacity.
Genia
Genia7d ago
Hey! I have the same issue. Even tho I have enabled active workers that my production service was relying upon, they have been throttled and my service became unreliable. I lose clients guys. Here how it looks, but I have 2 active workers enabled. But Runpod doesn't give them to me, and I saw it having 0 workers. I think it's horrible.
No description
No description
Milad
Milad7d ago
RTX 4090 is completely unstable at the moment
Solution
Dj
Dj6d ago
I believe I've spoken to all of you in a mixture of other threads and in the general channel - but sharing this for visibility: Throughout this week we've been running emergency maintenance and the users most affected are those running serverless workloads with popular GPUs. Where we may have a surplus of a specific GPU, we have to delist the machines that host the GPUs (where it's up to 8 GPUs per machine) to perform work on them. We are obligated to perform this maintenance across the fleet and only ask for your patience until it's done and we can disclose the reason.
Evgeniy_Wis
Evgeniy_Wis6d ago
When it will be finished?
Dj
Dj6d ago
We're doing small subsets of machines daily until Friday.
Mandulis - Flix
why didnt we got informed beforehand but only when the shit hits the fan? That's an odd behaviour tbh. and not how clients should get treated no matter you are obligated to or not. Don't you also have an obligation to us, your paying customers? 😄
Dj
Dj6d ago
We were only capable of informing users with persistent Pods due to the nature of our delisting process. When we mark a machine for maintenence, it emails anyone running a Pod on the host. For example, even my own customer account which only runs serverless workloads didn't get an email about this. While I also want reform for this process, our hands are tied due to the nature of the problem. I truly really can't tell you the reason until we're done.
Mandulis - Flix
Thank you for your transparency. yes, same here. only serverless and thsi created lots of issues throughout today.
Ginterhauser
GinterhauserOP6d ago
OK, thank you for the information. Even though I would have loved to be notified without figuring it out on my own (we had half a day of near downtime before migrating :/) I see that it wasn't possible in this case. Good to hear that you're working on improving the communications
Dj
Dj6d ago
It's a little last minute for us too, and any process improvements I suggest just can't be applied yet because of the timeline we're obligated to here. We'll look to making this less uncomfortable in the future, we have a few ideas already.
QuantumWizard
QuantumWizard6d ago
Same for me, and can't select other gpu cause other is low
No description
Evgeniy_Wis
Evgeniy_Wis6d ago
So will the problem still be relevant tomorrow?
Dj
Dj6d ago
All of our scheduled maintenance for tomorrow is postponed.
PotapovS
PotapovS6d ago
Hi. How long are they being postponed? When will at least minimal work for 4090 be restored?
Hugo
Hugo6d ago
Hello, ive also been running into similar issues, so i've added more GPU options 32, 80, and 141 GB however my rollout has been stuck in this state for the past 3-4 hours. endpoint is unusuable. any way to get it rolled out?
No description
Evgeniy_Wis
Evgeniy_Wis6d ago
When can we expect a solution to this problem, and what should we do if our business relies on daily use of serverless?
Genia
Genia5d ago
I thought we are going to be stable after Friday. Is it no longer true? Then it seems we need to move to another provider.
Evgeniy_Wis
Evgeniy_Wis5d ago
Can you recommend good providers?
Kriskras2
Kriskras23d ago
It's already Saturday, but work still hasn't returned to normal. How much longer do we have to wait?

Did you find this page helpful?