R
Runpod2mo ago
Xeverian

Workers stuck at "Running" indefinitely until removed by hand

It was all good for a month or so but lately (3-4 days) many of my workers started randomly being stuck at "Running" with errors like this: (this is worker qv6a8l4769xx3q). They keep running like this, generating infinite uptime and seemingly affecting the queue (noticed many jobs waiting for minutes despite having 5 idle workers ready)
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.43/containers/create?name=qv6a8l4769xx3q-0": context deadline exceeded
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
9/6/2025, 7:41:18 PM
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:41:18 PM
error creating container: container: create: container create: exit status 1
9/6/2025, 7:41:37 PM
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:41:57 PM
error creating container: container: create: container create: exit status 1
9/6/2025, 7:42:16 PM
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:42:16 PM
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:43:35 PM
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.43/containers/create?name=qv6a8l4769xx3q-0": context deadline exceeded
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
9/6/2025, 7:41:18 PM
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:41:18 PM
error creating container: container: create: container create: exit status 1
9/6/2025, 7:41:37 PM
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:41:57 PM
error creating container: container: create: container create: exit status 1
9/6/2025, 7:42:16 PM
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:42:16 PM
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:43:35 PM
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
No description
10 Replies
Xeverian
XeverianOP2mo ago
the image itself is nothing special, and it worked without problems before, no changes were made to the image (I even increased the required disk size just in case) after a while I've got this in the worker's log, but is still was shown as idle:
remove container
WARN: container is unhealthy: dead
remove container
WARN: container is unhealthy: dead
it is happening again: one request in queue and one running worker, doing nothing
Xeverian
XeverianOP2mo ago
No description
Xeverian
XeverianOP2mo ago
nothing in the log this time even
No description
Xeverian
XeverianOP2mo ago
It keeps happening to new workers with no apparent reason
No description
Xeverian
XeverianOP2mo ago
No description
Xeverian
XeverianOP2mo ago
Also getting many of those:
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
absolutely no changes made on my side before it all started to happening
Xeverian
XeverianOP2mo ago
do I pay for this as well?
No description
Dj
Dj2mo ago
Can you email help@Runpod.io? We will get all of this sorted and I'll make sure they refund you for everything you've spent on this.
Xeverian
XeverianOP2mo ago
just sent an e-mail with the same subject as the topic here. Please look into it, as it affects my production service serverely (ticket # 23140)
ipeterov
ipeterov2w ago
hi @Dj , I'm getting the same thing too with my serverless endpoint - It started happening more after I updated my base image to pytorch/pytorch:2.8.0-cuda12.9-cudnn9-runtime (to get compatibility with 5090s) - it's happening from time to time, and usually it works without issues - I do have a filter set to only allow CUDA 12.9 in my endpoint settings probably the worst thing for me is that it gets stuck in "Running" - the worker process doesn't see that comfy died and keeps waiting and waiting. I guess that's something I can look into myself but why does that even happen in the first place? is the 12.9 tag inaccurate on some machines? is there more nuance to this? I think I might have fixed this in my custom image by returning an error (return {"error": "ComfyUI server unavailable - failed to start or connect after maximum retries"}) in my runpod handler instead of doing os.exit(1) it's probably still happening, but I'm not feeling the impact nearly as much

Did you find this page helpful?