R
Runpod•2w ago
3WaD

Queue Delay Time

What is currently normal delay time? I remember that previously, it was normal to have delay times in milliseconds and container startups were near instant even on the coldest of cold starts. But lately, I have been observing queue delays I don't recognize. Up to the point where even my vLLM image can fully initialize the engine on cold start and compute a full response almost in the same time RunPod takes just to start the container. Similar goes for SDXL, although it is a bit better there. Does container image size affect it? But then why would it happen also on warm requests? This makes it unthinkable to use. Creating a new endpoint or downgrading the RunPod SDK version didn't help. And overall there's nothing I could find that would allow me to influence this further. Plus as I said, it's not happening only on cold starts. The delay on a warm worker is even more extreme as it's sometimes longer than the execution time itself. (See screenshots) *Please note that delay time in this case is truly only the queue delay, as I had to move the initialization (loading models etc.) into the handler where it's counted as execution because I wanted to allow users to change vLLM configuration per-cold start via request payload. Any help would be appreciated since this is a major deal breaker.
19 Replies
3WaD
3WaDOP•2w ago
vLLM coldstarts:
No description
No description
3WaD
3WaDOP•2w ago
SDXL (Fooocus) coldstarts:
No description
No description
3WaD
3WaDOP•2w ago
vLLM warm request:
No description
Jason
Jason•2w ago
which scaling type did you use? in the endpoint both results in the same delay?
3WaD
3WaDOP•2w ago
I normally use request count set to 1 as recommended in the docs, but it happens on queue delay too.
Jason
Jason•2w ago
do you want to open a support ticket then? container size, how big are you talking about
3WaD
3WaDOP•2w ago
What queue delay times do you observe on your endpoints? I am trying to understand if it's normal or if I am doing something wrong. I am using around ~20GB images with models baked in. I thought it might affect this because of loading from disk, but then why it would happen on warm request too?
Jason
Jason•2w ago
hmm so in total the image size? maybe its best to create a ticket for staff to check your endpoint\
3WaD
3WaDOP•2w ago
Maybe. But as I said it's happening on fresh new endpoints too so perhaps they would have to check the whole account 😀
Jason
Jason•2w ago
yeah sure, maybe they can find something
3WaD
3WaDOP•2w ago
I will try storing the models in network volume today to see if there's a change. But it would really help to know if others observe this too
Jason
Jason•4d ago
Which docker image you used for the base too btw? i really dont use serverless these days btw so i couldnt compare on production endpoints Have you created any tickets regarding this issue?
Poddy
Poddy•4d ago
@3WaD
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #18647
3WaD
3WaDOP•3d ago
Not yet. I first wanted to know if other users have this problem too.
Jason
Jason•3d ago
I think it's better to report it first so that they know this is a problem you're experiencing
River
River•3d ago
So, the. answer here is that we do caching on a best-case basis, and if there is a high amount of demand from other customers, unfortunately, we will have to cache clear more often causing higher delays
3WaD
3WaDOP•3d ago
So, the delay is caused on the job scheduling level in your system and has to be cached in order to be fast? There's nothing I can do on my side to reduce it? How many requests does it usually take to cache it? Because as stated, it's happening also on warm worker requests, right after the previous ones.
River
River•3d ago
The general rule is that smaller models load faster, and it isn't about the amount of requests but the delay between the requests Can I ask how much delay is there between requests on your end?
3WaD
3WaDOP•2d ago
The usage is inconsistent and not super frequent. Especially when in development. That's why we use serverless in the first place, right? @River To make it clear, based on the email you've sent me, these replies, and the RunPod AI (good thing you have at least that), the recommendation is only the standard - to pick different GPUs and use them more frequently? Shouldn't there be something like "delay-based priority GPU selection" that would initialize workers with the highest availability and lowest scheduling time GPUs on the endpoint then? Why we're supposed to observe the availability (which we even can't) and manage this ourselves on a serverless platform? The traffic of others on a specific GPU you picked making your product unavailable doesn't sound very production-ready.

Did you find this page helpful?