Runpod•7mo ago

Queue Delay Time

What is currently normal delay time? I remember that previously, it was normal to have delay times in milliseconds and container startups were near instant even on the coldest of cold starts. But lately, I have been observing queue delays I don't recognize. Up to the point where even my vLLM image can fully initialize the engine on cold start and compute a full response almost in the same time RunPod takes just to start the container. Similar goes for SDXL, although it is a bit better there. Does container image size affect it? But then why would it happen also on warm requests?

This makes it unthinkable to use. Creating a new endpoint or downgrading the RunPod SDK version didn't help. And overall there's nothing I could find that would allow me to influence this further. Plus as I said, it's not happening only on cold starts. The delay on a warm worker is even more extreme as it's sometimes longer than the execution time itself. (See screenshots)

*Please note that delay time in this case is truly only the queue delay, as I had to move the initialization (loading models etc.) into the handler where it's counted as execution because I wanted to allow users to change vLLM configuration per-cold start via request payload.

Any help would be appreciated since this is a major deal breaker.

3WaDOP•5/28/25, 4:12 AM

vLLM coldstarts:

3WaDOP•5/28/25, 4:13 AM

SDXL (Fooocus) coldstarts:

3WaDOP•5/28/25, 4:14 AM

vLLM warm request:

Jason•5/28/25, 5:33 AM

which scaling type did you use? in the endpoint

Jason•5/28/25, 5:33 AM

both results in the same delay?

3WaDOP•5/28/25, 1:11 PM

I normally use

request count

request count

set to 1 as recommended in the docs, but it happens on queue delay too.

Jason•5/28/25, 1:13 PM

do you want to open a support ticket then?

Jason•5/28/25, 1:13 PM

container size, how big are you talking about

3WaDOP•5/28/25, 1:17 PM

What queue delay times do you observe on your endpoints? I am trying to understand if it's normal or if I am doing something wrong. I am using around ~20GB images with models baked in. I thought it might affect this because of loading from disk, but then why it would happen on warm request too?

33WaD What queue delay times do you observe on your endpoints? I am trying to understa...

Jason•5/28/25, 1:19 PM

hmm so in total the image size?

Jason•5/28/25, 1:19 PM

maybe its best to create a ticket for staff to check your endpoint\

3WaDOP•5/28/25, 1:21 PM

Maybe. But as I said it's happening on fresh new endpoints too so perhaps they would have to check the whole account

Jason•5/28/25, 1:23 PM

yeah sure, maybe they can find something

3WaDOP•5/28/25, 1:24 PM

I will try storing the models in network volume today to see if there's a change. But it would really help to know if others observe this too

Jason•5/28/25, 1:25 PM

Which docker image you used for the base too btw?

Jason•5/28/25, 1:29 PM

i really dont use serverless these days btw so i couldnt compare on production endpoints

Jason•6/7/25, 1:35 AM

Have you created any tickets regarding this issue?

33WaD What is currently normal delay time? I remember that previously, it was normal t...

PoddyAPP•6/7/25, 1:35 AM

@3WaD

Escalated To Zendesk

The thread has been escalated to Zendesk!

Ticket ID: #18647

3WaDOP•6/7/25, 12:29 PM

Not yet. I first wanted to know if other users have this problem too.

Jason•6/7/25, 1:48 PM

I think it's better to report it first so that they know this is a problem you're experiencing

River•6/7/25, 2:25 PM

So, the. answer here is that we do caching on a best-case basis, and if there is a high amount of demand from other customers, unfortunately, we will have to cache clear more often

River•6/7/25, 2:25 PM

causing higher delays

RRiver So, the. answer here is that we do caching on a best-case basis, and if there is...

3WaDOP•6/7/25, 2:39 PM

So, the delay is caused on the job scheduling level in your system and has to be cached in order to be fast? There's nothing I can do on my side to reduce it? How many requests does it usually take to cache it? Because as stated, it's happening also on warm worker requests, right after the previous ones.

33WaD So, the delay is caused on the job scheduling level in your system and has to be...

River•6/7/25, 2:48 PM

The general rule is that smaller models load faster, and it isn't about the amount of requests but the delay between the requests

River•6/7/25, 2:49 PM

Can I ask how much delay is there between requests on your end?

RRiver Can I ask how much delay is there between requests on your end?

3WaDOP•6/7/25, 3:03 PM

The usage is inconsistent and not super frequent. Especially when in development. That's why we use serverless in the first place, right?

3WaDOP•6/8/25, 2:25 PM

@River To make it clear, based on the email you've sent me, these replies, and the RunPod AI (good thing you have at least that), the recommendation is only the standard - to pick different GPUs and use them more frequently?

Shouldn't there be something like "delay-based priority GPU selection" that would initialize workers with the highest availability and lowest scheduling time GPUs on the endpoint then? Why we're supposed to observe the availability (which we even can't) and manage this ourselves on a serverless platform? The traffic of others on a specific GPU you picked making your product unavailable doesn't sound very production-ready.

33WaD @River To make it clear, based on the email you've sent me, these replies, and t...

eric.mattmann•6/10/25, 3:57 AM

Indeed… we have started looking elsewhere

3WaDOP•6/11/25, 4:14 AM

@River I've created a tiny testing container with just the minimal working handler to exclude the size and software of my custom images. It's still happening. On all data centres, on multiple GPU types (I even tested the recommended L40s stated in the email), on all request types (sync, async, openAI), and on a fresh new endpoint. When it's cold-start it takes 6-8s+ just to start the container. Even when you spam the warm worker with requests, the best-case delay is like 10x what it used to be in the past. This state is nowhere near the marketed <250ms (which was true in the past, I remember the same endpoint, with the same container performing like that), and unfortunately unusable for low-latency tasks.

Because other users share the same problem and are even leaving the platform because of this, perhaps it deserves a bit of attention.

Dj•6/11/25, 4:17 PM

@yhlong00000 Wanna take a look at this?

3WaDOP•6/11/25, 7:59 PM

So far, I've been able to achieve good delay times only when spamming active workers. But even in that case, the first request is still bad, which doesn't make much sense - the container should already run and be ready, that's what the user pays for.

3WaDOP•6/11/25, 8:00 PM

This is the delay when spamming non-active warm workers:

3WaDOP•6/11/25, 8:01 PM

Also, one could ask why we care about the cold-start/first request delays so much when spamming a lot of requests seems to be usable. Serverless is supposed to scale to 0 and its core idea is for users/products with unpredictable usage patterns. Very often, especially when RunPod experiences high traffic, the workers don't stay warm for a long time, sometimes they even do only one request and shift away from the endpoint. This can result in most of the requests being cold-starts, and having up to 10s delay time on top of the actual cold-start time of the container is simply not usable for production.

yhlong00000•6/12/25, 12:46 PM

Hey, I totally understand your frustration, let me try to break it down a bit.

When we talk about cold starts, there’s actually quite a bit happening behind the scenes. After you send the initial request, our system puts it into a queue and signals a worker (somewhere in a data center) to wake up. That process includes starting the container, loading your model from disk into GPU VRAM, and initializing everything. Once the worker is ready, it pulls the request from the queue and all these contribute to the delay time. So, the larger your image or model, the longer the cold start will be. There’s no real magic to make that happen in just a few hundred milliseconds.

For subsequent requests, we use something called Flashboot (or “warm workers” as you’ve probably seen). That helps reduce start time since we can cache things on our end. But as you noticed, it’s not guaranteed, it’s more of a best-effort system. It depends on factors like:
• How popular the GPU you’re using is
• How frequent your traffic is

In general, if you’re on a less popular GPU and sending frequent requests, your worker has a higher chance of being cached, and you could see delay times drop to around 0.25–2 seconds.

The goal of serverless is to achieve speed and low cost, but the reality is: if you want fast performance and are running large models, you’ll likely need to keep active workers around, higher idle timeout and that comes with higher costs. Basically: fast, big model, and cheap, you can usually only pick two.

Yyhlong00000 Hey, I totally understand your frustration, let me try to break it down a bit. ...

Jason•6/12/25, 12:49 PM

I think what they are trying to say is the delay from the job queue

yhlong00000•6/12/25, 12:52 PM

The delay time you see in the UI or output reflects the time from when you send the request to when the worker is fully awake and picks it up from the queue.

33WaD So far, I've been able to achieve good delay times only when spamming active wor...

Jason•6/12/25, 12:53 PM

They also show this discrepancy when spamming request using active workers vs no active workers, oh wait I think it makes sense then

Yyhlong00000 Hey, I totally understand your frustration, let me try to break it down a bit. ...

3WaDOP•6/12/25, 3:55 PM

*Sigh. It really feels that all the text I am writing doesn't matter.

Container cold-start time which includes things like model loading or initialization of the software running on it, and the job queue delay caused by the platform are two different things. You can refer to them as one but there's a huge difference - The user has control over the first. And if you read the previous message before your reply, I showed how I made a minimal handler based on a slim Python image, with no models, no software running on it, and just a dummy response executing in a few milliseconds (see the screenshots), yet it's behaving the same. So stating the job queue delay is caused by a model loading time, or even that the job is started only after the software on the container is initialized seems to be simply wrong.

What is not addressed in these generic responses, is why the delay can be 0.25 or as high as 3s I observed on the flashboot warm request. Or the main issue I talked about - why every first request to a new worker stays 6+ seconds in the queue. This can happen even when spamming/using the endpoint very actively or even when having active workers. A single worker can simply shift out (it happens a lot) and once the request is sent to a different one, you get this behaviour. It wasn't happening in the past, and it's not happening on competing platforms, that's why users switch to them.

I understand you have a lot of issues to solve and a lot of work to do. But could you please consider treating this issue as one coming from a developer who spends his free time and his own money to make community implementations and promote your platform? It's very hard to continue doing it otherwise.

33WaD *Sigh. It really feels that all the text I am writing doesn't matter. Container...

Xeverian•6/12/25, 4:02 PM

could you name other platforms that have serverless endpoints that run requests from custom docker container? I tried looking for them, but have a hard time finding one

XXeverian could you name other platforms that have serverless endpoints that run requests ...

3WaDOP•6/12/25, 4:09 PM

I would rather see this resolved than promoting the competition here. But since you asked, the users messaging me about this problem usually stated switching to beam.cloud.

XXeverian could you name other platforms that have serverless endpoints that run requests ...

Jason•6/12/25, 11:56 PM

What problem did you experience?

Jason•6/13/25, 12:27 AM

I think it can be fast
for 16gb gpus:
when its fast boot it can drop from 6.5s to 1.1 ish, when request comes and worker is still running (idle timeout) drops to 0.1

Jason•6/13/25, 12:29 AM

Maybe thats 1.1s normal for 16gb's? and normal for 80gbs to be around 0.9-1.0s ( from idle worker )

Jason•6/13/25, 12:32 AM

side note: that is the example worker which return string that is combined with input only

3WaDOP•6/13/25, 8:44 AM

So it's behaving the same for you. If the delay were around 1 second for all requests, it would be fine. The biggest problem is the first one. As I said, the workers on popular GPUs shift often, so with non-batch usage you can experience this on most of the requests.