R
Runpod•4mo ago
Genia

Unhealthy workers keep sabotaging production

As you can see, somehow 2/3 active workers + all flexible workers became unhealthy. I don't know the reason for this or if I have any power to fix it. However, without my involvement Runpod doesn't kill those workers and doesn't replace them automatically with healthy workers making my prod unstable. To resolve this incident I needed to manually kill unhealthy workers. I need some support on how to prevent or handle this situation.
No description
17 Replies
3WaD
3WaD•4mo ago
You can automate killing unhealthy workers via the API. It's just another thing the user has to have a server for while using a serverless platform. I would also try to check and fix the cause of the unhealthy states first. Provide more info about what you're running and logs of those errors, others might be able to help you.
Genia
GeniaOP•4mo ago
I wish I knew how can I find out the reason of those unhealthy workers. there's no proper logs for those workers anywhere. that's all I see.
No description
3WaD
3WaD•4mo ago
Ah, so the workers are unhealthy even without running any requests? In that case, this issue should definitely be escalated to a ticket.
Genia
GeniaOP•4mo ago
Yes, I don't think I have the control there. Also, from Grok research:
The Runpod documentation, as reviewed, does not provide an API call for automatically killing unhealthy Serverless workers. The system automatically retries them with exponential backoff for up to 7 days, and users can monitor health using the /health endpoint but cannot terminate individual workers via API.
The Runpod documentation, as reviewed, does not provide an API call for automatically killing unhealthy Serverless workers. The system automatically retries them with exponential backoff for up to 7 days, and users can monitor health using the /health endpoint but cannot terminate individual workers via API.
So, it seems there's no API for that. At least, grok didn't find it.
3WaD
3WaD•4mo ago
GET /endpoints/id lists workers in the selected endpoint. DELETE /pods/id removes the selected worker from the endpoint. There's also GraphQL API you can use.
Runpod Documentation
Find an endpoint by ID - Runpod Documentation
Returns a single endpoint.
Runpod Documentation
Delete a Pod - Runpod Documentation
Delete a Pod.
Genia
GeniaOP•4mo ago
I don't use pods, I use Serverless. Or does this work for Serverless?
3WaD
3WaD•4mo ago
Serverless workers are essentially pods in this context. You can terminate the worker with that API endpoint just fine. Also, before we call for support, can you also check the logs tab in your endpoint if there are some errors from the date and time you had unhealthy workers reported? Maybe something could be logged there.
Genia
GeniaOP•4mo ago
I found this, but I'm not sure how it's related. Besides, I also found this:
Not running in Kubernetes: disabling probe helpers.
Not running in Kubernetes: disabling probe helpers.
When I search for error logs specifically, I see plenty of these:
rror]ERROR | Error while getting job: \n
r9jt4vcy6c3izo[error]worker exited with exit code 1
6j36k9wqb3goqc[error]worker exited with exit code 1
cmwatzhb7kon8g[error]worker exited with exit code 1
g7mbqez0ikifyx[error]ERROR | Error while getting job: \n
error]ERROR | Error while running job sync-11221b11-675f-41d4-864b-2d00274117d4-u2: argument of type 'NoneType' is not iterable\n
ERROR | Error while getting job: 502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.runpod.ai/v2/dn8tskucwc6j2t/job-take/iwi7t2pu07zv6e?gpu=NVIDIA+GeForce+RTX+4090'\n
rror]ERROR | Error while getting job: \n
r9jt4vcy6c3izo[error]worker exited with exit code 1
6j36k9wqb3goqc[error]worker exited with exit code 1
cmwatzhb7kon8g[error]worker exited with exit code 1
g7mbqez0ikifyx[error]ERROR | Error while getting job: \n
error]ERROR | Error while running job sync-11221b11-675f-41d4-864b-2d00274117d4-u2: argument of type 'NoneType' is not iterable\n
ERROR | Error while getting job: 502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.runpod.ai/v2/dn8tskucwc6j2t/job-take/iwi7t2pu07zv6e?gpu=NVIDIA+GeForce+RTX+4090'\n
So, it seems with my latest build I updated runpod sdk version probably and something went off. At least it seems there's better logging.
3WaD
3WaD•4mo ago
Are you using the official RunPod's ComfyUI worker template?
Genia
GeniaOP•4mo ago
yes:) but I'm using it for a year already and didn't change anything in it. I use this: https://github.com/runpod-workers/cog-worker Yeah. I need to update my version! But I was using locked version 1.2.0 of runpod for building the image.
GitHub
GitHub - runpod-workers/cog-worker: Cog based workers to RunPod ser...
Cog based workers to RunPod serverless workers. . Contribute to runpod-workers/cog-worker development by creating an account on GitHub.
3WaD
3WaD•4mo ago
@Jason would you be so kind as to escalate this, please? Maybe they'll see more in their logs or pass this to the comfy template maintainer.
Genia
GeniaOP•4mo ago
I also see this
No description
Poddy
Poddy•4mo ago
@Genia
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #20362
Dj
Dj•4mo ago
Let me know how it goes with support, happy to help as well.
Genia
GeniaOP•4mo ago
Thank you:) You guys here are so nice 🙂 So, support told me to implement the functionality to kill those unhealthy workers via API and that's it. So, if I will have any new discoveries of how I fixed this, I will share in the thread.
yhlong00000
yhlong00000•4mo ago
https://docs.runpod.io/serverless/workers/overview#worker-states Here’s the doc for unhealthy workers. Usually, it means something is wrong with your image or code that’s causing the container to crash, it could also be running out of memory. I’d recommend checking the Log tab for any clues or error messages that might help identify the issue. Just killing unhealthy workers isn’t a good long-term solution. It’s important to identify and fix the root cause
No description
3WaD
3WaD•4mo ago
You stated you're using official RunPod ComfyUI, but then also that you convert a cog container and use a custom image tag. If you're using a custom handler and code, unfortunately, no template or project maintainer can directly help debug the errors you've encountered, and you'll have to do it yourself. Or, if you're willing to share the code, try to get help from the community. Since you most likely don't observe each job manually while it's running to catch the problem in real time, I would personally add more logging to your code, so when it happens next time, you have more information about it.

Did you find this page helpful?