ComfyUI Serverless Worker CUDA Errors
Some serverless workers run into runtime cuda errors and fail silently. Is there anyway to tackle this? Can I somehow get runpod to fire me a webhook so I can atleast retry? Any solutions to make serverless more predictable?
How are people deploying production level comfyui inference on serverless? Am I doing something wrong?
12 Replies
can you send the error?
I might have the same thing
Traceback (most recent call last):
File "/comfyui/main.py", line 132, in <module>
import execution
File "/comfyui/execution.py", line 14, in <module>
import comfy.model_management
File "/comfyui/comfy/model_management.py", line 221, in <module>
total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
^^^^^^^^^^^^^^^^^^
File "/comfyui/comfy/model_management.py", line 172, in get_torch_device
return torch.device(torch.cuda.current_device())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/cuda/init.py", line 1071, in current_device
_lazy_init()
File "/opt/venv/lib/python3.12/site-packages/torch/cuda/init.py", line 412, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
Happens like 25 seconds into execution, and there is no pattern to it. Happens randomly to a worker. And the worst part is that when there are a bunch of requests lined up, it eats up all the requests and silently fails all of them....
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
@MassterOogway
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #22895
switching to 4090 solved it for me
check here
https://discord.com/channels/912829806415085598/1412115196129841172/1412115196129841172
My workflow is such that it performs extremely well on 5090 vs 4090, I cannot lose that efficiency man
Like from my account?
I prefer it too, it just doesnt work atm and I need it
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Started happening to me too lately. I use CUDA 12.6 image and 4090
I'm having the same issue. Could you please tell me if there's a way to fix it?
use cuda 12.6-12.8 as 12.9 seem very unstable
I talked to runpod support, the guy told me to use 12.8
havnt had any problems ever since
5090 4090 all fine