Runpod•2mo ago

Workers stuck at "Running" indefinitely until removed by hand

It was all good for a month or so but lately (3-4 days) many of my workers started randomly being stuck at "Running" with errors like this: (this is worker qv6a8l4769xx3q). They keep running like this, generating infinite uptime and seemingly affecting the queue (noticed many jobs waiting for minutes despite having 5 idle workers ready)

create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.43/containers/create?name=qv6a8l4769xx3q-0": context deadline exceeded
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
9/6/2025, 7:41:18 PM
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:41:18 PM
error creating container: container: create: container create: exit status 1
9/6/2025, 7:41:37 PM
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:41:57 PM
error creating container: container: create: container create: exit status 1
9/6/2025, 7:42:16 PM
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:42:16 PM
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:43:35 PM
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1

create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.43/containers/create?name=qv6a8l4769xx3q-0": context deadline exceeded
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
9/6/2025, 7:41:18 PM
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:41:18 PM
error creating container: container: create: container create: exit status 1
9/6/2025, 7:41:37 PM
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:41:57 PM
error creating container: container: create: container create: exit status 1
9/6/2025, 7:42:16 PM
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:42:16 PM
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
9/6/2025, 7:43:35 PM
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1
create container davyjonesdev69/aigen:hybridv8
error creating container: container: create: container create: exit status 1

10 Replies

XeverianOP•2mo ago

the image itself is nothing special, and it worked without problems before, no changes were made to the image (I even increased the required disk size just in case) after a while I've got this in the worker's log, but is still was shown as idle:

remove container
WARN: container is unhealthy: dead

remove container
WARN: container is unhealthy: dead

it is happening again: one request in queue and one running worker, doing nothing

XeverianOP•2mo ago

nothing in the log this time even

XeverianOP•2mo ago

It keeps happening to new workers with no apparent reason

XeverianOP•2mo ago

Also getting many of those:

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

absolutely no changes made on my side before it all started to happening

XeverianOP•2mo ago

do I pay for this as well?

Dj•2mo ago

Can you email help@Runpod.io? We will get all of this sorted and I'll make sure they refund you for everything you've spent on this.

XeverianOP•2mo ago

just sent an e-mail with the same subject as the topic here. Please look into it, as it affects my production service serverely (ticket # 23140)

ipeterov•2w ago

hi @Dj , I'm getting the same thing too with my serverless endpoint - It started happening more after I updated my base image to pytorch/pytorch:2.8.0-cuda12.9-cudnn9-runtime (to get compatibility with 5090s) - it's happening from time to time, and usually it works without issues - I do have a filter set to only allow CUDA 12.9 in my endpoint settings probably the worst thing for me is that it gets stuck in "Running" - the worker process doesn't see that comfy died and keeps waiting and waiting. I guess that's something I can look into myself but why does that even happen in the first place? is the 12.9 tag inaccurate on some machines? is there more nuance to this? I think I might have fixed this in my custom image by returning an error (return {"error": "ComfyUI server unavailable - failed to start or connect after maximum retries"}) in my runpod handler instead of doing os.exit(1) it's probably still happening, but I'm not feeling the impact nearly as much

Gaming

Programming

Workers stuck at "Running" indefinitely until removed by hand

Did you find this page helpful?