R
RunPod4mo ago
ashleyk

Unacceptably high failed jobs suddenly

Suddenly almost 20% of my serverless jobs failed. I have never had this issue until yesterday. This is is completely UNACCEPTABLE that I am being charged for this immense fuck up and that my customers are being impacted. This needs to be resolved IMMEDIATELY and I demand a refund for this!
26 Replies
ashleyk
ashleyk4mo ago
@flash-singh @Zeen @JM This is completely UNACCEPTABLE and needs to be RESOLVED IMMEDIATELY and I demand a refund. Must be some infrastructure issue because I don't even have any error logs for any of my failed jobs. Also extremely suspicious that the increase in failed jobs coincides with less workers being throttled.
ashleyk
ashleyk4mo ago
No description
Baran
Baran4mo ago
Same here. Not 20% but I still way more than before
ashleyk
ashleyk4mo ago
Jobs should only fail if there is an error in the severless handler code, which never happened in my case. I also don't know how this issue is supposed to be debugged when there are no error logs for any of the failed jobs. Looks like most of them failed due to executionTimeout exceeded. My jobs shouldn't take more than 5 minutes to execute, there is something wrong with the workers. My jobs take 3 minutes max to execute so something is seriously wrong here, and I have been running this endpoint in various different regions for several months and never had this issue until now. I would also expect to see these executionTimeout errors in the logs for my endpoint, but they aren't in the logs.
ashleyk
ashleyk4mo ago
Also I don't know why my IP was rate limited on Saturday, this has never happened before and I wasn't even sending that many requests.
No description
ashleyk
ashleyk4mo ago
This is making serverless more and more unusable by the day. Each of those 2 terminal windows has a different public IP as well, so there is really no reason why I should have been rate limited.
flash-singh
flash-singh4mo ago
is that 429 error in your handler code or somewhere else?
ashleyk
ashleyk4mo ago
Checking the staus of my jobs mostly. Also trying to create a new job.
flash-singh
flash-singh4mo ago
why not use webhooks?
ashleyk
ashleyk4mo ago
Webhooks can be unreliable
flash-singh
flash-singh4mo ago
does this endpoint get a lot of volume?
ashleyk
ashleyk4mo ago
It fluctuates, weekends are busier than during the week and evenings are busier than the day time (my day time, because most of the customers are in the US). But there were a lot more requests than usual over the weekend. Not anything massive though, its on average < 1000 requests per day.
ashleyk
ashleyk4mo ago
And we havent even done 300 jobs today yet, but it already had 30 failed jobs which is not normal.
No description
ashleyk
ashleyk4mo ago
C = Completed
F = Failed
R = Retried
C = Completed
F = Failed
R = Retried
And failed graph to display the spike in failed jobs is above. The table is from the metrics API, and the graph above is from the health API. Also can't use a webhook because its all running on an internal VPC on AWS which is not publicly accessible. And if I use a webhook I can't check whether my jobs are stuck on IN_PROGRESS for too long and automatically cancel them.
flash-singh
flash-singh4mo ago
may have to change failed to something else when it times put, cant tell if its that or fail at job level
ashleyk
ashleyk4mo ago
There is already TIMED_OUT can't it just use that?
flash-singh
flash-singh4mo ago
yes thats the plan, in my fix i made it failed, should change it back
ashleyk
ashleyk4mo ago
Oh yeah, I think its better to change it back 👍
JM
JM4mo ago
Hey @ashleyk Hit me up with your endpoint ID; will provide you credits 👍
octopus
octopus4mo ago
Gotta give @ashleyk a job at this point, he helps everyone
ashleyk
ashleyk4mo ago
HI @JMendpoint id is sdj01thu7r2mxx. There were issues 24th, 25th, 26th Jan, where my billing escalated more than usual, and the the execution time spiked slightly on 24th and 25th but massively on 26th when there were so many failed jobs. Things seem to have stabilised from 26th Feb onwards.
No description
ashleyk
ashleyk4mo ago
By the way, there is a large gap because I had to to switch my endpoint to a different region because all my workers were throttled.
JM
JM4mo ago
Uh That's no good, thanks for explaining Btw, I was literally buried in work, I found more hardware for everyone Apologies for delay in responding @ashleyk Credited the account! Thanks for helping everyone
ashleyk
ashleyk4mo ago
Thats awesome news that you found new hardware @JM , it will make us all very happy, thank you! 🙏 . No worries about the delay in responding and thanks very much for the credits, its cool that you can do it directly now and don't need to issue credit codes anymore. Helping people is only a pleasure. 🫶
JM
JM4mo ago
Yep, engineering has been helping me and Justin very hard lately; new admin features like this one always help so much! Take care sir, let me know if you need anything. Need to go to bed now
ashleyk
ashleyk4mo ago
Awesome news, thanks very much, you take care too, and have a good nights rest 🙏