Queue Delay Time
What is currently normal delay time? I remember that previously, it was normal to have delay times in milliseconds and container startups were near instant even on the coldest of cold starts. But lately, I have been observing queue delays I don't recognize. Up to the point where even my vLLM image can fully initialize the engine on cold start and compute a full response almost in the same time RunPod takes just to start the container. Similar goes for SDXL, although it is a bit better there. Does container image size affect it? But then why would it happen also on warm requests?
This makes it unthinkable to use. Creating a new endpoint or downgrading the RunPod SDK version didn't help. And overall there's nothing I could find that would allow me to influence this further. Plus as I said, it's not happening only on cold starts. The delay on a warm worker is even more extreme as it's sometimes longer than the execution time itself. (See screenshots)
*Please note that delay time in this case is truly only the queue delay, as I had to move the initialization (loading models etc.) into the handler where it's counted as execution because I wanted to allow users to change vLLM configuration per-cold start via request payload.
Any help would be appreciated since this is a major deal breaker.
19 Replies
vLLM coldstarts:


SDXL (Fooocus) coldstarts:


vLLM warm request:

which scaling type did you use? in the endpoint
both results in the same delay?
I normally use
request count
set to 1 as recommended in the docs, but it happens on queue delay too.do you want to open a support ticket then?
container size, how big are you talking about
What queue delay times do you observe on your endpoints? I am trying to understand if it's normal or if I am doing something wrong. I am using around ~20GB images with models baked in. I thought it might affect this because of loading from disk, but then why it would happen on warm request too?
hmm so in total the image size?
maybe its best to create a ticket for staff to check your endpoint\
Maybe. But as I said it's happening on fresh new endpoints too so perhaps they would have to check the whole account 😀
yeah sure, maybe they can find something
I will try storing the models in network volume today to see if there's a change. But it would really help to know if others observe this too
Which docker image you used for the base too btw?
i really dont use serverless these days btw so i couldnt compare on production endpoints
Have you created any tickets regarding this issue?
@3WaD
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #18647
Not yet. I first wanted to know if other users have this problem too.
I think it's better to report it first so that they know this is a problem you're experiencing
So, the. answer here is that we do caching on a best-case basis, and if there is a high amount of demand from other customers, unfortunately, we will have to cache clear more often
causing higher delays
So, the delay is caused on the job scheduling level in your system and has to be cached in order to be fast? There's nothing I can do on my side to reduce it? How many requests does it usually take to cache it? Because as stated, it's happening also on warm worker requests, right after the previous ones.
The general rule is that smaller models load faster, and it isn't about the amount of requests but the delay between the requests
Can I ask how much delay is there between requests on your end?
The usage is inconsistent and not super frequent. Especially when in development. That's why we use serverless in the first place, right?
@River To make it clear, based on the email you've sent me, these replies, and the RunPod AI (good thing you have at least that), the recommendation is only the standard - to pick different GPUs and use them more frequently?
Shouldn't there be something like "delay-based priority GPU selection" that would initialize workers with the highest availability and lowest scheduling time GPUs on the endpoint then? Why we're supposed to observe the availability (which we even can't) and manage this ourselves on a serverless platform? The traffic of others on a specific GPU you picked making your product unavailable doesn't sound very production-ready.