Runpod•6mo ago

Requests stuck in queue

Hi I am having issues with my serverless deployment - tasks are stuck in queue for 6-10 min, while there are idle workers (screenshot 1) I believe the issue to be with how the container is started, and not with the image itself. This image had been running for months without issues. I tried an assortment of older images (which also all worked fine) to double-check, and the issue persists. A task arrives, status is IN_QUEUE, list of requests shows the request with idle time of X mins, and runtime of 0. In list of workers, the first worker is shown to have picked up the task, and is in 'running' state If we open worker telemetry, it shows empty usage of vram, CPU etc, and runtime starts with 0s every time we open the telemetry tab. I believe that no telemetry data is available since the worker is not really running. If we try to connect via SSH, we get 'connection rejected' (for the same reason I believe). Logs are empty too. For context - I have logs collected in grafana cloud, it is not receiving logs too when problem reproduces (so it's not just a runpod logs UI issue) If we kill the worker via the 'trash' icon, the task appears in UI to have been reassigned to another worker, in which case there is 50/50 chance that: 1. The task was really picked up, the worker runs (we can SSH, telemetry data and logs are shown) 2. The task was NOT picked up, the 'running' worker displays symptoms from above The only reliable solution I found is to scale the endpoint to 0 workers, and then back to X workers. Obviously this does not work for production, so we are considering other options. After workers are inited, they start actually processing stuff as long as the workers are warm. The problem reappears after first cold start of a worker, excluding the first start after the worker is created (so maybe there is an issue reloading images from cache?)

12 Replies

DIRECTcut ▲OP•6mo ago

I found this thread, that recommends the user with a similar problem to rebuild their image, but: 1. This very image worked previously, and so did older images from that repository 2. The image does have the runpod package, does have a valid entrypoint, and is always healthy when it's able to start (so it's not a container crash, but a failure to start at all) The image is based on nvidia/cuda:12.6.3-cudnn-runtime-ubuntu22.04. ~12GB size, stored in a private dockerhub repo (DH keys added to runpod obviously). The image is based on runpod-comfy, entrypoint and wrapper code in rp_handler are the same (some functionality is added to rp_handler, but is irrelevant, as the worker fails to start at all). AFAIK, when the entrypoint code crashes, the worker is displayed as 'unhealthy' in the UI, I've seen that before, but this is not the case here - the worker never starts Any help is greatly appreciated, cheers

DIRECTcut ▲OP•6mo ago

Here are screenshots of the problem

DIRECTcut ▲OP•6mo ago

OK, this is weird. I submitted a request, it got stuck in queue. One of the workers (worker 1) showed as running, but no telemetry, no ssh connection. I ran another request, it got picked up by another worker (worker 2) (this time the worker was indeed running, I was able to connect with ssh), and when the worker 2 completed the first request, it picked up the first request and processed it without issues The (worker 1) is still shown as running, even though no tasks are in queue. Am I getting billed for this too?

Unknown User•6mo ago

Message Not Public

DIRECTcut ▲OP•6mo ago

So far no solution in sight… Can we escalate this for someone on the team to take a look? I’m considering switching providers due to this, a shame, since overall the runpod experience has been solid so far :/

Unknown User•6mo ago

Message Not Public

Poddy•6mo ago

@DIRECTcut ▲

Escalated To Zendesk

The thread has been escalated to Zendesk!

Ticket ID: #16930

Unknown User•6mo ago

Message Not Public

DIRECTcut ▲OP•6mo ago

Yeah, this “running” worker is not really running because no telemetry is reported and no ssh conn. Wasnt able to record a vid, screenshots are confusing, sorry I will try one more thing - to pin cuda version for the worker (maybe some libs clash with cuda version on the worker, and different workers can have different cuda version, hence the perceived randomness of the problem)

Unknown User•6mo ago

Message Not Public

Jaber•6mo ago

I have the same issue

Unknown User•6mo ago

Message Not Public

Gaming

Programming

Requests stuck in queue

Did you find this page helpful?