R
Runpod•4w ago
ipeterov

How to deal with initialization errors?

I went to sleep and woke up to logs of multiple users trying out image generation only for 100% of requests to fail. After a brief investigation I found a machine with this in the logs:
Traceback (most recent call last):
File "/comfyui/main.py", line 132, in <module>
import execution
File "/comfyui/execution.py", line 14, in <module>
import comfy.model_management
File "/comfyui/comfy/model_management.py", line 221, in <module>
total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
File "/comfyui/comfy/model_management.py", line 172, in get_torch_device
return torch.device(torch.cuda.current_device())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1026, in current_device
_lazy_init()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 372, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
--- Starting Serverless Worker | Version 1.7.13 ---
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "bd072380-b893-41c6-8239-d1a738738efa-e2", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "bd072380-b893-41c6-8239-d1a738738efa-e2", "message": "Finished.", "level": "INFO"}
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "9ea27b3f-0170-4b5e-bd8c-8c202913dd61-e2", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "9ea27b3f-0170-4b5e-bd8c-8c202913dd61-e2", "message": "Finished.", "level": "INFO"}
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "33e38811-5b65-450b-8614-89f9cf703f5a-e1", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "33e38811-5b65-450b-8614-89f9cf703f5a-e1", "message": "Finished.", "level": "INFO"}
Traceback (most recent call last):
File "/comfyui/main.py", line 132, in <module>
import execution
File "/comfyui/execution.py", line 14, in <module>
import comfy.model_management
File "/comfyui/comfy/model_management.py", line 221, in <module>
total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
File "/comfyui/comfy/model_management.py", line 172, in get_torch_device
return torch.device(torch.cuda.current_device())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1026, in current_device
_lazy_init()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 372, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
--- Starting Serverless Worker | Version 1.7.13 ---
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "bd072380-b893-41c6-8239-d1a738738efa-e2", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "bd072380-b893-41c6-8239-d1a738738efa-e2", "message": "Finished.", "level": "INFO"}
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "9ea27b3f-0170-4b5e-bd8c-8c202913dd61-e2", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "9ea27b3f-0170-4b5e-bd8c-8c202913dd61-e2", "message": "Finished.", "level": "INFO"}
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "33e38811-5b65-450b-8614-89f9cf703f5a-e1", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "33e38811-5b65-450b-8614-89f9cf703f5a-e1", "message": "Finished.", "level": "INFO"}
I'm assuming the issue here is that the machine was misconfigured, and it's not something with my code. So my question is - how can I avoid that in the future? Do I need to monitor those errors and kill the worker through the API? Can a worker shut itself down after is sees an error like that? Is there a healthcheck I can leverage?
5 Replies
yhlong00000
yhlong00000•4w ago
Technical we should handle that better to detect that. You might be able to implement some code on your end, if you get cuda error, terminate that worker. Feel free to open a support ticket and report the worker id to us
Vectris
Vectris•3w ago
I'm also experiencing this error - any idea of the root cause?
PotapovS
PotapovS•3w ago
Hey šŸ– I'm also encountering this issue. Every 2–3 hours, a CUDA error occurs and the worker doesn't stop, continuing to burn money. From previous threads, I set the worker's CUDA version to 12.7, 12.8, and 12.9, but that didn’t help. Please, while you're investigating this problem, provide an example of code that we can add to our workers so that they immediately terminate and stop charging money for nothing, allowing the task to switch to another worker.
ipeterov
ipeterovOP•3w ago
@yhlong00000 I didn't record the worker ID first time, but I got the error again. Here's the worker ID: m4q8mbq9j69ks9 Also, relevant part of the logs:
Checkpoint files will always be loaded safely.

Traceback (most recent call last):

File "/comfyui/main.py", line 132, in <module>

import execution

File "/comfyui/execution.py", line 14, in <module>

import comfy.model_management

File "/comfyui/comfy/model_management.py", line 221, in <module>

total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)

File "/comfyui/comfy/model_management.py", line 172, in get_torch_device

return torch.device(torch.cuda.current_device())

File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1026, in current_device

_lazy_init()

File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 372, in _lazy_init

torch._C._cuda_init()

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

--- Starting Serverless Worker | Version 1.7.13 ---

{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}

{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}

{"requestId": "863c0bef-396a-4479-99f7-d38359c7c796-e2", "message": "Started.", "level": "INFO"}

runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.

runpod-worker-comfy - image(s) upload

{"requestId": "863c0bef-396a-4479-99f7-d38359c7c796-e2", "message": "Captured Handler Exception", "level": "ERROR"}
Checkpoint files will always be loaded safely.

Traceback (most recent call last):

File "/comfyui/main.py", line 132, in <module>

import execution

File "/comfyui/execution.py", line 14, in <module>

import comfy.model_management

File "/comfyui/comfy/model_management.py", line 221, in <module>

total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)

File "/comfyui/comfy/model_management.py", line 172, in get_torch_device

return torch.device(torch.cuda.current_device())

File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1026, in current_device

_lazy_init()

File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 372, in _lazy_init

torch._C._cuda_init()

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

--- Starting Serverless Worker | Version 1.7.13 ---

{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}

{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}

{"requestId": "863c0bef-396a-4479-99f7-d38359c7c796-e2", "message": "Started.", "level": "INFO"}

runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.

runpod-worker-comfy - image(s) upload

{"requestId": "863c0bef-396a-4479-99f7-d38359c7c796-e2", "message": "Captured Handler Exception", "level": "ERROR"}
fullcirclenetworks
fullcirclenetworks•2w ago
It's tough because we don't have access to log messages via the API, correct? I have been trying to automate the debugging of failed endpoints and I get an error "šŸ“„ Recent Log Sample: [2025-08-13T17:21:48.305511] INFO: Log retrieval not available via API"

Did you find this page helpful?