Requests stuck in queue
Hi
I am having issues with my serverless deployment - tasks are stuck in queue for 6-10 min, while there are idle workers (screenshot 1)
I believe the issue to be with how the container is started, and not with the image itself.
This image had been running for months without issues. I tried an assortment of older images (which also all worked fine) to double-check,
and the issue persists.
A task arrives, status is IN_QUEUE, list of requests shows the request with idle time of X mins, and runtime of 0.
In list of workers, the first worker is shown to have picked up the task, and is in 'running' state
If we open worker telemetry, it shows empty usage of vram, CPU etc, and runtime starts with 0s every time we open the telemetry tab.
I believe that no telemetry data is available since the worker is not really running. If we try to connect via SSH, we get 'connection rejected'
(for the same reason I believe). Logs are empty too. For context - I have logs collected in grafana cloud, it is not receiving logs too when problem reproduces
(so it's not just a runpod logs UI issue)
If we kill the worker via the 'trash' icon, the task appears in UI to have been reassigned to another worker, in which case there is 50/50 chance that:
1. The task was really picked up, the worker runs (we can SSH, telemetry data and logs are shown)
2. The task was NOT picked up, the 'running' worker displays symptoms from above
The only reliable solution I found is to scale the endpoint to 0 workers, and then back to X workers. Obviously this does not work
for production, so we are considering other options. After workers are inited, they start actually processing stuff as long as the workers are warm. The problem reappears after first cold start of a worker, excluding the first start after the worker is created (so maybe there is an issue reloading images from cache?)

12 Replies
I found this thread, that recommends the user with a similar problem to rebuild their image, but:
1. This very image worked previously, and so did older images from that repository
2. The image does have the runpod package, does have a valid entrypoint, and is always healthy when it's able to start (so it's not a container crash, but a failure to start at all)
The image is based on
nvidia/cuda:12.6.3-cudnn-runtime-ubuntu22.04
. ~12GB size, stored in a private dockerhub repo (DH keys added to runpod obviously). The image is based on runpod-comfy, entrypoint and wrapper code in rp_handler
are the same (some functionality is added to rp_handler, but is irrelevant, as the worker fails to start at all).
AFAIK, when the entrypoint code crashes, the worker is displayed as 'unhealthy' in the UI, I've seen that before, but this is not the case here - the worker never starts
Any help is greatly appreciated, cheersHere are screenshots of the problem






OK, this is weird. I submitted a request, it got stuck in queue. One of the workers (worker 1) showed as running, but no telemetry, no ssh connection.
I ran another request, it got picked up by another worker (worker 2) (this time the worker was indeed running, I was able to connect with ssh), and when the worker 2 completed the first request, it picked up the first request and processed it without issues
The (worker 1) is still shown as running, even though no tasks are in queue. Am I getting billed for this too?


I hope you can get this resolved soon, please update me here when it's resolved
When worker is running it is billed
But worker won't be running for so long without jobs ( because of idle timeout only)
So far no solution in sight… Can we escalate this for someone on the team to take a look? I’m considering switching providers due to this, a shame, since overall the runpod experience has been solid so far :/
oh you haven't created any ticket?
@DIRECTcut ▲
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #16930
but in the screenshot it shows only 1 worker running, is it your idle timeout?
Yeah, this “running” worker is not really running because no telemetry is reported and no ssh conn. Wasnt able to record a vid, screenshots are confusing, sorry
I will try one more thing - to pin cuda version for the worker (maybe some libs clash with cuda version on the worker, and different workers can have different cuda version, hence the perceived randomness of the problem)
yeah no logs is tiring
I have the same issue
Open a ticket
https://contact.runpod.io/hc/en-us/requests/new