Runpod•4mo ago

Unhealthy workers keep sabotaging production

As you can see, somehow 2/3 active workers + all flexible workers became unhealthy. I don't know the reason for this or if I have any power to fix it. However, without my involvement Runpod doesn't kill those workers and doesn't replace them automatically with healthy workers making my prod unstable. To resolve this incident I needed to manually kill unhealthy workers. I need some support on how to prevent or handle this situation.

17 Replies

3WaD•4mo ago

You can automate killing unhealthy workers via the API. It's just another thing the user has to have a server for while using a serverless platform. I would also try to check and fix the cause of the unhealthy states first. Provide more info about what you're running and logs of those errors, others might be able to help you.

GeniaOP•4mo ago

I wish I knew how can I find out the reason of those unhealthy workers. there's no proper logs for those workers anywhere. that's all I see.

3WaD•4mo ago

Ah, so the workers are unhealthy even without running any requests? In that case, this issue should definitely be escalated to a ticket.

GeniaOP•4mo ago

Yes, I don't think I have the control there. Also, from Grok research:

The Runpod documentation, as reviewed, does not provide an API call for automatically killing unhealthy Serverless workers. The system automatically retries them with exponential backoff for up to 7 days, and users can monitor health using the /health endpoint but cannot terminate individual workers via API.

The Runpod documentation, as reviewed, does not provide an API call for automatically killing unhealthy Serverless workers. The system automatically retries them with exponential backoff for up to 7 days, and users can monitor health using the /health endpoint but cannot terminate individual workers via API.

So, it seems there's no API for that. At least, grok didn't find it.

3WaD•4mo ago

GET /endpoints/id lists workers in the selected endpoint. DELETE /pods/id removes the selected worker from the endpoint. There's also GraphQL API you can use.

Runpod Documentation

Find an endpoint by ID - Runpod Documentation

Returns a single endpoint.

Runpod Documentation

Delete a Pod - Runpod Documentation

Delete a Pod.

GeniaOP•4mo ago

I don't use pods, I use Serverless. Or does this work for Serverless?

3WaD•4mo ago

Serverless workers are essentially pods in this context. You can terminate the worker with that API endpoint just fine. Also, before we call for support, can you also check the logs tab in your endpoint if there are some errors from the date and time you had unhealthy workers reported? Maybe something could be logged there.

GeniaOP•4mo ago

I found this, but I'm not sure how it's related. Besides, I also found this:

Not running in Kubernetes: disabling probe helpers.

Not running in Kubernetes: disabling probe helpers.

When I search for error logs specifically, I see plenty of these:

rror]ERROR   | Error while getting job: \n
r9jt4vcy6c3izo[error]worker exited with exit code 1
6j36k9wqb3goqc[error]worker exited with exit code 1
cmwatzhb7kon8g[error]worker exited with exit code 1
g7mbqez0ikifyx[error]ERROR   | Error while getting job: \n
error]ERROR   | Error while running job sync-11221b11-675f-41d4-864b-2d00274117d4-u2: argument of type 'NoneType' is not iterable\n
ERROR   | Error while getting job: 502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.runpod.ai/v2/dn8tskucwc6j2t/job-take/iwi7t2pu07zv6e?gpu=NVIDIA+GeForce+RTX+4090'\n

rror]ERROR   | Error while getting job: \n
r9jt4vcy6c3izo[error]worker exited with exit code 1
6j36k9wqb3goqc[error]worker exited with exit code 1
cmwatzhb7kon8g[error]worker exited with exit code 1
g7mbqez0ikifyx[error]ERROR   | Error while getting job: \n
error]ERROR   | Error while running job sync-11221b11-675f-41d4-864b-2d00274117d4-u2: argument of type 'NoneType' is not iterable\n
ERROR   | Error while getting job: 502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.runpod.ai/v2/dn8tskucwc6j2t/job-take/iwi7t2pu07zv6e?gpu=NVIDIA+GeForce+RTX+4090'\n

So, it seems with my latest build I updated runpod sdk version probably and something went off. At least it seems there's better logging.

message.txt

3WaD•4mo ago

Are you using the official RunPod's ComfyUI worker template?

GeniaOP•4mo ago

yes:) but I'm using it for a year already and didn't change anything in it. I use this: https://github.com/runpod-workers/cog-worker Yeah. I need to update my version! But I was using locked version 1.2.0 of runpod for building the image.

GitHub

GitHub - runpod-workers/cog-worker: Cog based workers to RunPod ser...

Cog based workers to RunPod serverless workers. . Contribute to runpod-workers/cog-worker development by creating an account on GitHub.

3WaD•4mo ago

@Jason would you be so kind as to escalate this, please? Maybe they'll see more in their logs or pass this to the comfy template maintainer.

GeniaOP•4mo ago

I also see this

Poddy•4mo ago

@Genia

Escalated To Zendesk

The thread has been escalated to Zendesk!

Ticket ID: #20362

Dj•4mo ago

Let me know how it goes with support, happy to help as well.

GeniaOP•4mo ago

Thank you:) You guys here are so nice 🙂 So, support told me to implement the functionality to kill those unhealthy workers via API and that's it. So, if I will have any new discoveries of how I fixed this, I will share in the thread.

yhlong00000•4mo ago

https://docs.runpod.io/serverless/workers/overview#worker-states Here’s the doc for unhealthy workers. Usually, it means something is wrong with your image or code that’s causing the container to crash, it could also be running out of memory. I’d recommend checking the Log tab for any clues or error messages that might help identify the issue. Just killing unhealthy workers isn’t a good long-term solution. It’s important to identify and fix the root cause

Runpod Documentation

Worker overview - Runpod Documentation

3WaD•4mo ago

You stated you're using official RunPod ComfyUI, but then also that you convert a cog container and use a custom image tag. If you're using a custom handler and code, unfortunately, no template or project maintainer can directly help debug the errors you've encountered, and you'll have to do it yourself. Or, if you're willing to share the code, try to get help from the community. Since you most likely don't observe each job manually while it's running to catch the problem in real time, I would personally add more logging to your code, so when it happens next time, you have more information about it.

Gaming

Programming

Unhealthy workers keep sabotaging production

Did you find this page helpful?