How to deal with initialization errors?
I went to sleep and woke up to logs of multiple users trying out image generation only for 100% of requests to fail.
After a brief investigation I found a machine with this in the logs:
I'm assuming the issue here is that the machine was misconfigured, and it's not something with my code.
So my question is - how can I avoid that in the future? Do I need to monitor those errors and kill the worker through the API? Can a worker shut itself down after is sees an error like that? Is there a healthcheck I can leverage?
5 Replies
Technical we should handle that better to detect that. You might be able to implement some code on your end, if you get cuda error, terminate that worker. Feel free to open a support ticket and report the worker id to us
I'm also experiencing this error - any idea of the root cause?
Hey š
I'm also encountering this issue.
Every 2ā3 hours, a CUDA error occurs and the worker doesn't stop, continuing to burn money.
From previous threads, I set the worker's CUDA version to 12.7, 12.8, and 12.9, but that didnāt help.
Please, while you're investigating this problem, provide an example of code that we can add to our workers so that they immediately terminate and stop charging money for nothing, allowing the task to switch to another worker.
@yhlong00000 I didn't record the worker ID first time, but I got the error again. Here's the worker ID:
m4q8mbq9j69ks9
Also, relevant part of the logs:
It's tough because we don't have access to log messages via the API, correct? I have been trying to automate the debugging of failed endpoints and I get an error "š Recent Log Sample:
[2025-08-13T17:21:48.305511] INFO: Log retrieval not available via API"