Runpod•2y ago

Issues in SE region causing a massive amount of jobs to be retried

The issues in the screenshot are causing 10% of my jobs to be retried in SE region. Please fix this, its not happening in CA region.

20 Replies

digigoblinOP•2y ago

Obviously I am referring to the "Connection timeout" errors which causes the job results to fail to be returned, and not the single exeption among them.

Madiator2011•2y ago

@digigoblin DO YOU MIND SUBMITING AS TICKET ON WEBSITE EASIER TO ESCALATE

digigoblinOP•2y ago

No need to shout but sure 😁

Madiator2011•2y ago

ups sorry for caps

digigoblinOP•2y ago

Ticket number is 4208

Madiator2011•2y ago

done

digigoblinOP•2y ago

Thank you

Unknown User•2y ago

Message Not Public

digigoblinOP•2y ago

You probably didn't try and send 1000 jobs today

Unknown User•2y ago

Message Not Public

digigoblinOP•2y ago

I said 10% are retried NOT ALL 🤦‍♂️

Unknown User•2y ago

Message Not Public

digigoblinOP•2y ago

They are retried they don't fail

Unknown User•2y ago

Message Not Public

digigoblinOP•2y ago

RunPod needs to check it out, I switched to CA in the meantime and it works fine without any issues.

Unknown User•2y ago

Message Not Public

digigoblinOP•2y ago

I was using CA but then switched to SE because my jobs were failing, but it was actually because my own Redis server had OOM issues due to running out of memory and wasn't a RunPod issue. So I upgraded my ElastiCache instance on AWS from cache.t3.medium to cache.m4.large and now its fine.

Unknown User•2y ago

Message Not Public

digigoblinOP•2y ago

Because its a cluster not a single instance

Unknown User•2y ago

Message Not Public

Gaming

Programming

Issues in SE region causing a massive amount of jobs to be retried

Did you find this page helpful?