Throttling on multiple endpoints and failed workers
All of our endpoints with RTX 4090 workers are fully throttled, some with over 100+ workers. There is no incident report or any update here or the status page. Workers consiostently come up and get stuck in loading the image and to top it all they are in the executing state and charge the account.
10 Replies
+
+1 has been experiencing a lot today
Same here
Seems like RunPod has supply and demand issues
Throughout this week we've been running emergency maintenance and the users most affected are those running serverless workloads with popular GPUs. Where we may have a surplus of a specific GPU, we have to delist the machines that host the GPUs (where it's up to 8 GPUs per machine) to perform work on them.
We are obligated to perform this maintenance across the fleet and only ask for your patience until it's done and we can disclose the reason.
Just started using serverless today... is this normal?
No, just caught us at a bad time - sorry.
Based on my experience, Runpod does not appear to be production ready. Each time I’ve attempted to use it or deploy a workload, I’ve encountered issues, with no documented incident reports and unanswered emails. This calls into question the claim of SOC 2 Type II compliance. In the past, I also reported significant slowness (“delay time”) for which I was billed; the root cause was never identified and the issue remained unresolved. Sad...

The email linked to the pods in this screenshot has no support tickets. Can you send me a message with your ticket ids or the email you used to contact us?