Requests stuck in queue

Hi I am having issues with my serverless deployment - tasks are stuck in queue for 6-10 min, while there are idle workers (screenshot 1) I believe the issue to be with how the container is started, and not with the image itself. This image had been running for months without issues. I tried an assortment of older images (which also all worked fine) to double-check, and the issue persists. A task arrives, status is IN_QUEUE, list of requests shows the request with idle time of X mins, and runtime of 0. In list of workers, the first worker is shown to have picked up the task, and is in 'running' state If we open worker telemetry, it shows empty usage of vram, CPU etc, and runtime starts with 0s every time we open the telemetry tab. I believe that no telemetry data is available since the worker is not really running. If we try to connect via SSH, we get 'connection rejected' (for the same reason I believe). Logs are empty too. For context - I have logs collected in grafana cloud, it is not receiving logs too when problem reproduces (so it's not just a runpod logs UI issue) If we kill the worker via the 'trash' icon, the task appears in UI to have been reassigned to another worker, in which case there is 50/50 chance that: 1. The task was really picked up, the worker runs (we can SSH, telemetry data and logs are shown) 2. The task was NOT picked up, the 'running' worker displays symptoms from above The only reliable solution I found is to scale the endpoint to 0 workers, and then back to X workers. Obviously this does not work for production, so we are considering other options. After workers are inited, they start actually processing stuff as long as the workers are warm. The problem reappears after first cold start of a worker, excluding the first start after the worker is created (so maybe there is an issue reloading images from cache?)
No description
12 Replies
DIRECTcut ▲
DIRECTcut ▲OP2w ago
I found this thread, that recommends the user with a similar problem to rebuild their image, but: 1. This very image worked previously, and so did older images from that repository 2. The image does have the runpod package, does have a valid entrypoint, and is always healthy when it's able to start (so it's not a container crash, but a failure to start at all) The image is based on nvidia/cuda:12.6.3-cudnn-runtime-ubuntu22.04. ~12GB size, stored in a private dockerhub repo (DH keys added to runpod obviously). The image is based on runpod-comfy, entrypoint and wrapper code in rp_handler are the same (some functionality is added to rp_handler, but is irrelevant, as the worker fails to start at all). AFAIK, when the entrypoint code crashes, the worker is displayed as 'unhealthy' in the UI, I've seen that before, but this is not the case here - the worker never starts Any help is greatly appreciated, cheers
DIRECTcut ▲
DIRECTcut ▲OP2w ago
Here are screenshots of the problem
No description
No description
No description
No description
No description
No description
DIRECTcut ▲
DIRECTcut ▲OP2w ago
OK, this is weird. I submitted a request, it got stuck in queue. One of the workers (worker 1) showed as running, but no telemetry, no ssh connection. I ran another request, it got picked up by another worker (worker 2) (this time the worker was indeed running, I was able to connect with ssh), and when the worker 2 completed the first request, it picked up the first request and processed it without issues The (worker 1) is still shown as running, even though no tasks are in queue. Am I getting billed for this too?
No description
No description
Jason
Jason2w ago
I hope you can get this resolved soon, please update me here when it's resolved When worker is running it is billed But worker won't be running for so long without jobs ( because of idle timeout only)
DIRECTcut ▲
DIRECTcut ▲OP2w ago
So far no solution in sight… Can we escalate this for someone on the team to take a look? I’m considering switching providers due to this, a shame, since overall the runpod experience has been solid so far :/
Jason
Jason2w ago
oh you haven't created any ticket?
Poddy
Poddy2w ago
@DIRECTcut ▲
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #16930
Jason
Jason2w ago
but in the screenshot it shows only 1 worker running, is it your idle timeout?
DIRECTcut ▲
DIRECTcut ▲OP2w ago
Yeah, this “running” worker is not really running because no telemetry is reported and no ssh conn. Wasnt able to record a vid, screenshots are confusing, sorry I will try one more thing - to pin cuda version for the worker (maybe some libs clash with cuda version on the worker, and different workers can have different cuda version, hence the perceived randomness of the problem)
Jason
Jason2w ago
yeah no logs is tiring
자베르
자베르6d ago
I have the same issue

Did you find this page helpful?