Runpod•4w ago

How to deal with initialization errors?

I went to sleep and woke up to logs of multiple users trying out image generation only for 100% of requests to fail. After a brief investigation I found a machine with this in the logs:

Traceback (most recent call last):
  File "/comfyui/main.py", line 132, in <module>
    import execution
  File "/comfyui/execution.py", line 14, in <module>
    import comfy.model_management
  File "/comfyui/comfy/model_management.py", line 221, in <module>
    total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
  File "/comfyui/comfy/model_management.py", line 172, in get_torch_device
    return torch.device(torch.cuda.current_device())
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1026, in current_device
    _lazy_init()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 372, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
--- Starting Serverless Worker |  Version 1.7.13 ---
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "bd072380-b893-41c6-8239-d1a738738efa-e2", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "bd072380-b893-41c6-8239-d1a738738efa-e2", "message": "Finished.", "level": "INFO"}
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "9ea27b3f-0170-4b5e-bd8c-8c202913dd61-e2", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "9ea27b3f-0170-4b5e-bd8c-8c202913dd61-e2", "message": "Finished.", "level": "INFO"}
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "33e38811-5b65-450b-8614-89f9cf703f5a-e1", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "33e38811-5b65-450b-8614-89f9cf703f5a-e1", "message": "Finished.", "level": "INFO"}

Traceback (most recent call last):
  File "/comfyui/main.py", line 132, in <module>
    import execution
  File "/comfyui/execution.py", line 14, in <module>
    import comfy.model_management
  File "/comfyui/comfy/model_management.py", line 221, in <module>
    total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
  File "/comfyui/comfy/model_management.py", line 172, in get_torch_device
    return torch.device(torch.cuda.current_device())
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1026, in current_device
    _lazy_init()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 372, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
--- Starting Serverless Worker |  Version 1.7.13 ---
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "bd072380-b893-41c6-8239-d1a738738efa-e2", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "bd072380-b893-41c6-8239-d1a738738efa-e2", "message": "Finished.", "level": "INFO"}
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "9ea27b3f-0170-4b5e-bd8c-8c202913dd61-e2", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "9ea27b3f-0170-4b5e-bd8c-8c202913dd61-e2", "message": "Finished.", "level": "INFO"}
{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
{"requestId": "33e38811-5b65-450b-8614-89f9cf703f5a-e1", "message": "Started.", "level": "INFO"}
runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.
{"requestId": "33e38811-5b65-450b-8614-89f9cf703f5a-e1", "message": "Finished.", "level": "INFO"}

I'm assuming the issue here is that the machine was misconfigured, and it's not something with my code. So my question is - how can I avoid that in the future? Do I need to monitor those errors and kill the worker through the API? Can a worker shut itself down after is sees an error like that? Is there a healthcheck I can leverage?

5 Replies

yhlong00000•4w ago

Technical we should handle that better to detect that. You might be able to implement some code on your end, if you get cuda error, terminate that worker. Feel free to open a support ticket and report the worker id to us

Vectris•3w ago

I'm also experiencing this error - any idea of the root cause?

PotapovS•3w ago

Hey 🖐 I'm also encountering this issue. Every 2–3 hours, a CUDA error occurs and the worker doesn't stop, continuing to burn money. From previous threads, I set the worker's CUDA version to 12.7, 12.8, and 12.9, but that didn’t help. Please, while you're investigating this problem, provide an example of code that we can add to our workers so that they immediately terminate and stop charging money for nothing, allowing the task to switch to another worker.

ipeterovOP•3w ago

@yhlong00000 I didn't record the worker ID first time, but I got the error again. Here's the worker ID: m4q8mbq9j69ks9 Also, relevant part of the logs:

Checkpoint files will always be loaded safely.

Traceback (most recent call last):

  File "/comfyui/main.py", line 132, in <module>

    import execution

  File "/comfyui/execution.py", line 14, in <module>

    import comfy.model_management

  File "/comfyui/comfy/model_management.py", line 221, in <module>

    total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)

  File "/comfyui/comfy/model_management.py", line 172, in get_torch_device

    return torch.device(torch.cuda.current_device())

  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1026, in current_device

    _lazy_init()

  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 372, in _lazy_init

    torch._C._cuda_init()

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

--- Starting Serverless Worker |  Version 1.7.13 ---

{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}

{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}

{"requestId": "863c0bef-396a-4479-99f7-d38359c7c796-e2", "message": "Started.", "level": "INFO"}

runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.

runpod-worker-comfy - image(s) upload

{"requestId": "863c0bef-396a-4479-99f7-d38359c7c796-e2", "message": "Captured Handler Exception", "level": "ERROR"}

Checkpoint files will always be loaded safely.

Traceback (most recent call last):

  File "/comfyui/main.py", line 132, in <module>

    import execution

  File "/comfyui/execution.py", line 14, in <module>

    import comfy.model_management

  File "/comfyui/comfy/model_management.py", line 221, in <module>

    total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)

  File "/comfyui/comfy/model_management.py", line 172, in get_torch_device

    return torch.device(torch.cuda.current_device())

  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1026, in current_device

    _lazy_init()

  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 372, in _lazy_init

    torch._C._cuda_init()

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

--- Starting Serverless Worker |  Version 1.7.13 ---

{"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}

{"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}

{"requestId": "863c0bef-396a-4479-99f7-d38359c7c796-e2", "message": "Started.", "level": "INFO"}

runpod-worker-comfy - Failed to connect to server at http://127.0.0.1:8188 after 500 attempts.

runpod-worker-comfy - image(s) upload

{"requestId": "863c0bef-396a-4479-99f7-d38359c7c796-e2", "message": "Captured Handler Exception", "level": "ERROR"}

fullcirclenetworks•2w ago

It's tough because we don't have access to log messages via the API, correct? I have been trying to automate the debugging of failed endpoints and I get an error "📄 Recent Log Sample: [2025-08-13T17:21:48.305511] INFO: Log retrieval not available via API"

Gaming

Programming

How to deal with initialization errors?

Did you find this page helpful?