Veryyyyyy slow serverless VLLM
Considering moving away from RunPod, this is just insane how slow this is on serverless
Runpod serverless 4090 GPU, cold start of vllm:
Model loading took 7.5552 GiB and 52.588290 seconds
My local 3090 cold start of vllm
Model loading took 7.5552 GiB and 1.300690 seconds
¿any idea?
37 Replies
This happens while using the exact same image on both machines?
Depending on your image, the cold start also includes the time to download and actually start the image itself.
It is a little rough for a little bit, and we're working on a solution for that but eventually you'll have the image cached on all the servers your serverless jobs can be picked up by (filtered by CUDA version & allowed GPUs)
The downloading of the container image is the worker initialization stage. He's talking about the initialization of the vLLM engine and the difference it takes to load the model into VRAM on RunPod vs his local machine.
Oh the actual vllm engine.
By default RunPod provides users a quite old vllm version (and I should probably bump this). Do you see the same delay using this container instead?
runpod/worker-v1-vllm:v2.5.0stable-cuda12.1.0
This image would contain iirc vllm 0.8.5 instead of the 0.6.x version we provide by default.That's why I am asking if he uses the exact same image on RunPod and his local machine. He can use a custom one or different versions and configs. There's too little information provided.
Same vLLM in both machines, RunPod is using
runpod/worker-v1-vllm:v2.4.0stable-cuda12.1.0
I'll test with runpod/worker-v1-vllm:v2.5.0stable-cuda12.1.0
and report backUsing version 2.5.0, the loading time is as follows: model loading took 7.5552 GiB and 51.169156 seconds still unusable. I'm also getting very inconsistent cold starts (pod load time). I think with it being this unstable, we have to stop using serverless. I don't remember it ever being this bad.

So, you're using the official image set via web UI? The model is baked into the image, so you're not using Network volume, correct? Does this happen in multiple data centre locations? Did you try a different one? I know from experience that there might be a slight difference in performance between them.
Also, can you please share a log of such a long cold start so we can confirm the problem?
I’ve tried several regions, and I also tried using the model inside the image by building and running the Docker image, as well as from the UI, and I get the same results.
I think we can try with the Qwen3 4B model and see if we get the same cold start, all with default settings.
Can you please share the logs when you get that long cold start?
You can copy it from
Workers
tab and clicking on currently running workerThe log you've sent is 42s for initialization. That's quite a normal time for the default image.
The longest part there is 14s for CUDA Graph capture (see
Capturing cudagraphs for decoding
), which you can disable by using Eager Mode, but it decreases the inference speed of the model quite drastically. Instead, I recommend storing the graphs and cache on a shared network volume so you compute it only once, and other cold workers with the same setup can reuse it. By disabling guided-decoding-backend
, you should also be able to use the V1 engine, which uses the torch.compile to cache even more things to speed up the inference.
The second longest is 11s for engine init itself (see engine.py
), which we can't do much about without not going into the code.
The third is unnecessary 8 seconds to download the model (see Time spent downloading weights
). You can fix this by using baked files and properly setting a local path to them.
By fixing just these basic things, you can cut the cold start time by ~50% (to around 20 seconds) without going into deeper optimizations. That's probably all I can say about this log. You can try sending the 2 or 3-minute ones if you catch them. The thing is, by default, both the vLLM and its RunPod image are not optimized for production use with maximum performance (and they can't even be for every model out there). You have to understand the settings, and dialling them right can make an enormous difference.
In my own images, I also implemented a "prewarming method". This allows you to send prewarm requests to continuously (or on-demand) keep the workers ready without cold starts. RunPod is testing a similar feature, but as far as I know, it's in closed beta.how did it work out for you?
you have to put the model in the container image, any other way like network storage will be much slower
It does. Sometimes you might have to spam it a bit the more warm workers you have and based on which autoscale type is selected. For my use case of a small Discord server AI bot, it's currently more than enough to have a simple "warm timer" and to send a couple of prewarm requests in a moment a user starts typing in the bot room if it's expired. A more robust solution would be to fetch the endpoint state via RunPod API periodically, keep a record of the worker IDs with their readiness and send prewarms based on that. But I am looking forward to the release of the official solution. That will definitely be the cleanest one.
This is not a solution that truly addresses the problem. We tried exactly that, and since we have very inconsistent warm-ups, even if the model is already on the image, we sometimes experience delays of up to one minute for the "warm-up" to start when a user begins typing or connects to the channel. If we make a second call and it turns out the pod was already "warm," it ends up launching a second instance. The issue here is that it takes anywhere from 1 to 2 minutes to start with the image or even up to 3 minutes with no guarantees. We tried few regions, but the problem only got worse. Is there any real solution to the cold start inconsistency?
I am worried that I can't help you more without seeing the 2-3-minute cold start log.
I've noticed that when it takes too long to start, this "throttled" message appears.

How to avoid this?
This would mean you have no idle workers available, and not an issue with their cold-start time. Your jobs will stay in the queue if that happens. You should select GPUs that are highly available in your region or check in which region your desired GPU type is available. You can also prioritize multiple types. "Throttled" means the worker is overloaded, possibly with workloads from other users. See https://docs.runpod.io/serverless/endpoints/endpoint-configurations
You should Add the max workers amount in the endpoint
if this is case then your using limited GPUs, L4s and A5000s are def more limited in certain regions, for 24GB the best one to use is 4090s since they have most availability, or find region with more A5000s, why do you need region selected if the model is in container image? any other limitations you have?
throttled simply means the server which the worker is cached on is fully utilized
increasing max workers does increase the footprint of your endpoint on to much wider hardware, also not selecting a specific region allows your endpoint to span many regions and load balance automatically
bumping this
INFO 05-17 20:09:56 [loader.py:458] Loading weights took 113.32 seconds
finetuned gemma3 12b model deployed via webui (vllm 0.8.5) with a network storage attached. is this a disk io issue?
That only constrains the real-world scenario where there are high request flows; I don’t see the point of keeping a max worker limit to avoid a Runpod issue. The problem is that, regardless of the region, it happens at different times (as shown in the image). We’ve tested different GPUs and regions, and the problem always occurs.
The suggestion of keeping the service “warm” for when someone might make a call isn’t really a solution. If the pod freezes and incurs a 30-second cold start on that “warming” call, the second call (the user’s real request) will spin up a new pod because of the delay. Unfortunately, we don’t see how to solve this problem there’s no consistent cold-start time; it’s always different and hard to manage. ¿Any real solution?
I'm receiving private messages asking how I solved this. I believe it's an issue that's either not well described in the documentation or something that needs attention from Runpod, because it seems I'm not the only one experiencing it.
Active workers, dedicated pod/server or even investing in your own hardware. Or, if you don't have enough traffic to justify that, find a managed service that hosts the model you like with token-based billing. https://openrouter.ai/models
Serverless is like this. It's dynamic, has its limits and wasn't really designed for huge workloads like AI in the first place. RunPod is one of the fastest GPU platforms that keep them mostly in usable range (trust me I tried others and it can be way worse).
I’ve been using Runpod for quite some time, since before the vLLM image could be deployed from the UI, and back when cold start times were consistent, which made them easy to manage. If that’s no longer the case and things are as you suggest, then fine it’s time to move on and use another platform. It’s a shame because it had worked well for us until recently.
Additionally, it's sad because we've been trying to find a solution to something that didn't use to happen. I came to ask for support just like I did in the past I once had a UTF-8 issue and they helped me until it was resolved. The last thing I was looking for was to start a debate about whether Runpod is fast and others are slow I know that because I’ve used Runpod for quite a while. I appreciate the comments. I’ll pass along the message so those messaging me privately know that this wasn’t designed for that purpose. Thanks, we can close this topic.
Yes, I observe more frequent worker shifting too. Perhaps it's because more and more people use RunPod? I mean, it pushes me to optimize the startup times, thus understanding the things I am implementing more deeply, but I understand not everyone wants to write their own, or even think about it. That being said, with my RunPod-FooocusAPI repo, we usually had like 12s cold starts vs 6s warm T2I executions, so with such times you don't care that much about it. That's why I am currently looking around for something else than vLLM. While it's very fast, especially for batched inference or the best for multi-GPU, it comes with the price of being possibly the longest-initiating LLM framework out there, which doesn't really make it a good pick for such serverless applications.
Hi !
Using a network volume to store the .cache folder is tempting, it could speed up vllm startup once most of the hardware configurations have been cycled through.
But a network volume can only be shared at a single location, so it limits your choices.
I believe those vllm/torch caches are quite small : do you know if it can be attempted to sync them from another network location than a shared volume? Doing this asynchronously at endpoint startup and then once it has warmed up could work…
The trick for fast cold starts is actually disabling all these optimizations. With https://github.com/davefojtik/RunPod-vLLM I was able to achieve sub-10-second cold starts by disabling Graph Capture, torch.compile and rewriting memory profiling in vLLM. It slows the throughput a lot, but you can't have fast startup and inference at the same time. Storing and loading cache and graphs from a shared network volume allows you to balance it a bit more, but the cold start still takes 30+ seconds in that case.
GitHub
GitHub - davefojtik/RunPod-vLLM: RunPod serverless worker for the v...
RunPod serverless worker for the vLLM AI text-gen inference. Simple, optimized and customisable. - davefojtik/RunPod-vLLM
Have you considered opening a PR for memory capture? I'd love to use some of the improvements here in the official runpod template im working on updating.
I was actually thinking about making PR to the vLLM itself but I didn't get to it yet. Plus it's for V0 and they push V1 hard now. But sure, let's see the just released 0.9.0-stable and I can look into it. But also feel free to use anything. That's why we do this open-source, right?
Which file did you modify on the memory profiling?
Maybe if it's quite adaptable I can help some
It's a very simple monkey patch of
vllm/engine/llm_engine.py
that results in calling the model_executor.determine_num_available_blocks()
only if the num_gpu_blocks_override is not specified.Oh how does that affect the llm profiling, the blocks
You can check only once how many free blocks for KV cache the GPU you're using has and then specify it as a static value. It also skips the dummy request model pre-warm, which is not something we want on serverless. Ideal if you're using predictable, single GPU size endpoint.
I'm working on optimizing our base images as well as I can without killing the user experience, I agree it's not ideal and a control to opt out of it would help every RunPod user using our vLLM image.