Reporting/blacklisting poorly performing workers

I've noticed that every now and then a bad worker is spawned for my endpoint which takes forever to complete the job when compared to other workers running the same job. Typically my job would take ~40s but there are occassionally workers that have the same gpu but take 70s instead. I want to blacklist these pods from running my endpoint so performance isnt impacted
20 Replies
Unknown User
Unknown User14mo ago
Message Not Public
Sign In & Join Server To View
Poddy
Poddy14mo ago
@1AndOnlyPika
Escalated To Zendesk
The thread has been escalated to Zendesk!
1AndOnlyPika
1AndOnlyPikaOP14mo ago
Yes endpoint 8ba6bkaiosbww6 also the delay times are very inconsistent, some take 20s to start even with flashboot on my regular cold start time is 5s
1AndOnlyPika
1AndOnlyPikaOP14mo ago
The same job, on the same machine is US-OR-1
No description
wuxmes
wuxmes14mo ago
same, would love if you could specify in request if a request should not be directed to a worker id I have a retry mechanism when executionTimeout happens, but then most of the times the job goes back to the same worker id : |
1AndOnlyPika
1AndOnlyPikaOP14mo ago
I've found that US-OR-1 location always has issues, whether its a slower worker, or workers that have broken gpus that wont even start the container. going to be removing it from the allowed locations for the time being
1AndOnlyPika
1AndOnlyPikaOP14mo ago
No description
1AndOnlyPika
1AndOnlyPikaOP14mo ago
:dead:
1AndOnlyPika
1AndOnlyPikaOP13mo ago
hey runpod, is there maybe a way to report these broken servers and have them fixed?
No description
Ethan Blake
Ethan Blake13mo ago
I have the same issue, I really need a api to blacklisting these bad workers
DannyB
DannyB13mo ago
I am also experiencing this issue
Ethan Blake
Ethan Blake13mo ago
I made a ticket on runpod's website, I told them that several of our companies really need to solve this problem. If they reply, I will synchronize it to the group.
1AndOnlyPika
1AndOnlyPikaOP13mo ago
yea, there should at least be some simplified way for us to report a worker and have somebody investigate&fix it
Unknown User
Unknown User13mo ago
Message Not Public
Sign In & Join Server To View
1AndOnlyPika
1AndOnlyPikaOP13mo ago
well, not very often maybe 10% but still annoying to deal with
Unknown User
Unknown User13mo ago
Message Not Public
Sign In & Join Server To View
One@DRT
One@DRT13mo ago
Not having the same issue but the request is almost the same. Sometimes a worker fails because of internal errors and misconfiguration, out of space because of memory purging errors and etc (happens in less than 1%). Unfortunately when this worker is in the batch all tasks going to the worker - fails. So it is always a job to be done by hand kicking that worker out of endpoint to stop errors. Definitely need an API to kick unhealthy workers from endpoints!
1AndOnlyPika
1AndOnlyPikaOP12mo ago
need a way to get someone on the team looking into the specific worker - otherwise it may be reallocated even after being deleted
LordOdin
LordOdin7mo ago
I’ve been using 100s of pods lately and we run in to this issue almost every day. Some pods are just worse.
Dj
Dj7mo ago
I have the ability to delist specific problematic servers while we investigate them. If you're having specific issues with any one machine or pods deployed to a specific machine (say you only use one DC) let me know.

Did you find this page helpful?