Requests stuck in queue
Hi
I am having issues with my serverless deployment - tasks are stuck in queue for 6-10 min, while there are idle workers (screenshot 1)
I believe the issue to be with how the container is started, and not with the image itself.
This image had been running for months without issues. I tried an assortment of older images (which also all worked fine) to double-check,
and the issue persists.
A task arrives, status is IN_QUEUE, list of requests shows the request with idle time of X mins, and runtime of 0.
In list of workers, the first worker is shown to have picked up the task, and is in 'running' state
If we open worker telemetry, it shows empty usage of vram, CPU etc, and runtime starts with 0s every time we open the telemetry tab.
I believe that no telemetry data is available since the worker is not really running. If we try to connect via SSH, we get 'connection rejected'
(for the same reason I believe). Logs are empty too. For context - I have logs collected in grafana cloud, it is not receiving logs too when problem reproduces
(so it's not just a runpod logs UI issue)
If we kill the worker via the 'trash' icon, the task appears in UI to have been reassigned to another worker, in which case there is 50/50 chance that:
for production, so we are considering other options. After workers are inited, they start actually processing stuff as long as the workers are warm. The problem reappears after first cold start of a worker, excluding the first start after the worker is created (so maybe there is an issue reloading images from cache?)
I am having issues with my serverless deployment - tasks are stuck in queue for 6-10 min, while there are idle workers (screenshot 1)
I believe the issue to be with how the container is started, and not with the image itself.
This image had been running for months without issues. I tried an assortment of older images (which also all worked fine) to double-check,
and the issue persists.
A task arrives, status is IN_QUEUE, list of requests shows the request with idle time of X mins, and runtime of 0.
In list of workers, the first worker is shown to have picked up the task, and is in 'running' state
If we open worker telemetry, it shows empty usage of vram, CPU etc, and runtime starts with 0s every time we open the telemetry tab.
I believe that no telemetry data is available since the worker is not really running. If we try to connect via SSH, we get 'connection rejected'
(for the same reason I believe). Logs are empty too. For context - I have logs collected in grafana cloud, it is not receiving logs too when problem reproduces
(so it's not just a runpod logs UI issue)
If we kill the worker via the 'trash' icon, the task appears in UI to have been reassigned to another worker, in which case there is 50/50 chance that:
- The task was really picked up, the worker runs (we can SSH, telemetry data and logs are shown)
- The task was NOT picked up, the 'running' worker displays symptoms from above
for production, so we are considering other options. After workers are inited, they start actually processing stuff as long as the workers are warm. The problem reappears after first cold start of a worker, excluding the first start after the worker is created (so maybe there is an issue reloading images from cache?)
