Unhealthy workers keep sabotaging production
As you can see, somehow 2/3 active workers + all flexible workers became unhealthy. I don't know the reason for this or if I have any power to fix it. However, without my involvement Runpod doesn't kill those workers and doesn't replace them automatically with healthy workers making my prod unstable. To resolve this incident I needed to manually kill unhealthy workers. I need some support on how to prevent or handle this situation.

17 Replies
You can automate killing unhealthy workers via the API. It's just another thing the user has to have a server for while using a serverless platform.
I would also try to check and fix the cause of the unhealthy states first. Provide more info about what you're running and logs of those errors, others might be able to help you.
I wish I knew how can I find out the reason of those unhealthy workers. there's no proper logs for those workers anywhere. that's all I see.

Ah, so the workers are unhealthy even without running any requests? In that case, this issue should definitely be escalated to a ticket.
Yes, I don't think I have the control there.
Also, from Grok research:
So, it seems there's no API for that. At least, grok didn't find it.
GET /endpoints/id lists workers in the selected endpoint. DELETE /pods/id removes the selected worker from the endpoint. There's also GraphQL API you can use.
I don't use pods, I use Serverless. Or does this work for Serverless?
Serverless workers are essentially pods in this context. You can terminate the worker with that API endpoint just fine. Also, before we call for support, can you also check the
logs tab in your endpoint if there are some errors from the date and time you had unhealthy workers reported? Maybe something could be logged there.I found this, but I'm not sure how it's related.
Besides, I also found this:
When I search for error logs specifically, I see plenty of these:
So, it seems with my latest build I updated runpod sdk version probably and something went off. At least it seems there's better logging.
Are you using the official RunPod's ComfyUI worker template?
yes:) but I'm using it for a year already and didn't change anything in it. I use this: https://github.com/runpod-workers/cog-worker
Yeah. I need to update my version! But I was using locked version 1.2.0 of runpod for building the image.
GitHub
GitHub - runpod-workers/cog-worker: Cog based workers to RunPod ser...
Cog based workers to RunPod serverless workers. . Contribute to runpod-workers/cog-worker development by creating an account on GitHub.
@Jason would you be so kind as to escalate this, please? Maybe they'll see more in their logs or pass this to the comfy template maintainer.
I also see this

@Genia
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #20362
Let me know how it goes with support, happy to help as well.
Thank you:) You guys here are so nice 🙂
So, support told me to implement the functionality to kill those unhealthy workers via API and that's it.
So, if I will have any new discoveries of how I fixed this, I will share in the thread.
https://docs.runpod.io/serverless/workers/overview#worker-states
Here’s the doc for unhealthy workers. Usually, it means something is wrong with your image or code that’s causing the container to crash, it could also be running out of memory. I’d recommend checking the Log tab for any clues or error messages that might help identify the issue.
Just killing unhealthy workers isn’t a good long-term solution. It’s important to identify and fix the root cause

You stated you're using official RunPod ComfyUI, but then also that you convert a cog container and use a custom image tag. If you're using a custom handler and code, unfortunately, no template or project maintainer can directly help debug the errors you've encountered, and you'll have to do it yourself. Or, if you're willing to share the code, try to get help from the community.
Since you most likely don't observe each job manually while it's running to catch the problem in real time, I would personally add more logging to your code, so when it happens next time, you have more information about it.
