ComfyUI Serverless Worker CUDA Errors

Some serverless workers run into runtime cuda errors and fail silently. Is there anyway to tackle this? Can I somehow get runpod to fire me a webhook so I can atleast retry? Any solutions to make serverless more predictable? How are people deploying production level comfyui inference on serverless? Am I doing something wrong?
12 Replies
Snow ❄
Snow ❄2mo ago
can you send the error? I might have the same thing
MassterOogway
MassterOogwayOP2mo ago
Traceback (most recent call last): File "/comfyui/main.py", line 132, in <module> import execution File "/comfyui/execution.py", line 14, in <module> import comfy.model_management File "/comfyui/comfy/model_management.py", line 221, in <module> total_vram = get_total_memory(get_torch_device()) / (1024 * 1024) ^^^^^^^^^^^^^^^^^^ File "/comfyui/comfy/model_management.py", line 172, in get_torch_device return torch.device(torch.cuda.current_device()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/torch/cuda/init.py", line 1071, in current_device _lazy_init() File "/opt/venv/lib/python3.12/site-packages/torch/cuda/init.py", line 412, in _lazy_init torch._C._cuda_init() RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu. Happens like 25 seconds into execution, and there is no pattern to it. Happens randomly to a worker. And the worst part is that when there are a bunch of requests lined up, it eats up all the requests and silently fails all of them....
Unknown User
Unknown User2mo ago
Message Not Public
Sign In & Join Server To View
Poddy
Poddy2mo ago
@MassterOogway
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #22895
MassterOogway
MassterOogwayOP2mo ago
My workflow is such that it performs extremely well on 5090 vs 4090, I cannot lose that efficiency man Like from my account?
Snow ❄
Snow ❄2mo ago
I prefer it too, it just doesnt work atm and I need it
Unknown User
Unknown User2mo ago
Message Not Public
Sign In & Join Server To View
Xeverian
Xeverian2mo ago
Started happening to me too lately. I use CUDA 12.6 image and 4090
billchen
billchen2mo ago
I'm having the same issue. Could you please tell me if there's a way to fix it?
Xeverian
Xeverian2mo ago
use cuda 12.6-12.8 as 12.9 seem very unstable
MassterOogway
MassterOogwayOP2mo ago
I talked to runpod support, the guy told me to use 12.8 havnt had any problems ever since 5090 4090 all fine

Did you find this page helpful?