vLLM - How to avoid downloading weights every time?

I have a Serverless Endpoint with vLLM. Using this Docker Image : runpod/worker-v1-vllm:v2.7.0stable-cuda12.1.0 My ENV var:
MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct-AWQ
DOWNLOAD_DIR=/runpod-volume
DTYPE=float16
GPU_MEMORY_UTILIZATION=0.90
ENABLE_PREFIX_CACHING=0
QUANTIZATION=awq_marlin
LIMIT_MM_PER_PROMPT=image=1,video=0
MAX_MODEL_LEN=16384
ENFORCE_EAGER=true
TRUST_REMOTE_CODE=true
VLLM_IMAGE_FETCH_TIMEOUT=10
HF_HOME=/runpod-volume/huggingface-cache
TRANSFORMERS_CACHE=/runpod-volume/huggingface-cache
MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct-AWQ
DOWNLOAD_DIR=/runpod-volume
DTYPE=float16
GPU_MEMORY_UTILIZATION=0.90
ENABLE_PREFIX_CACHING=0
QUANTIZATION=awq_marlin
LIMIT_MM_PER_PROMPT=image=1,video=0
MAX_MODEL_LEN=16384
ENFORCE_EAGER=true
TRUST_REMOTE_CODE=true
VLLM_IMAGE_FETCH_TIMEOUT=10
HF_HOME=/runpod-volume/huggingface-cache
TRANSFORMERS_CACHE=/runpod-volume/huggingface-cache
In the worker logs i have :
2025-08-03T09:30:56.128364829Z INFO 08-03 09:30:56 [model_runner.py:1171] Starting to load model Qwen/Qwen2.5-VL-3B-Instruct-AWQ...
2025-08-03T09:30:56.371946022Z WARNING 08-03 09:30:56 [vision.py:91] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
2025-08-03T09:30:56.833587201Z INFO 08-03 09:30:56 [weight_utils.py:292] Using model weights format ['*.safetensors']
2025-08-03T09:31:08.946316697Z INFO 08-03 09:31:08 [weight_utils.py:308] Time spent downloading weights for Qwen/Qwen2.5-VL-3B-Instruct-AWQ: 12.112183 seconds
2025-08-03T09:31:09.112774049Z INFO 08-03 09:31:09 [weight_utils.py:345] No model.safetensors.index.json found in remote.
2025-08-03T09:31:09.114026816Z
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
2025-08-03T09:31:11.829100920Z
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.71s/it]
2025-08-03T09:31:11.829134000Z
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.72s/it]
2025-08-03T09:31:11.943191906Z INFO 08-03 09:31:11 [default_loader.py:272] Loading weights took 2.83 seconds
2025-08-03T09:31:12.506203683Z INFO 08-03 09:31:12 [model_runner.py:1203] Model loading took 3.3186 GiB and 15.732077 seconds
2025-08-03T09:30:56.128364829Z INFO 08-03 09:30:56 [model_runner.py:1171] Starting to load model Qwen/Qwen2.5-VL-3B-Instruct-AWQ...
2025-08-03T09:30:56.371946022Z WARNING 08-03 09:30:56 [vision.py:91] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
2025-08-03T09:30:56.833587201Z INFO 08-03 09:30:56 [weight_utils.py:292] Using model weights format ['*.safetensors']
2025-08-03T09:31:08.946316697Z INFO 08-03 09:31:08 [weight_utils.py:308] Time spent downloading weights for Qwen/Qwen2.5-VL-3B-Instruct-AWQ: 12.112183 seconds
2025-08-03T09:31:09.112774049Z INFO 08-03 09:31:09 [weight_utils.py:345] No model.safetensors.index.json found in remote.
2025-08-03T09:31:09.114026816Z
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
2025-08-03T09:31:11.829100920Z
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.71s/it]
2025-08-03T09:31:11.829134000Z
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.72s/it]
2025-08-03T09:31:11.943191906Z INFO 08-03 09:31:11 [default_loader.py:272] Loading weights took 2.83 seconds
2025-08-03T09:31:12.506203683Z INFO 08-03 09:31:12 [model_runner.py:1203] Model loading took 3.3186 GiB and 15.732077 seconds
22 Replies
J.
J.4w ago
got to bake it into the image. the github repo got some instructions / you can point chatgpt to it to ask for further instructions. Unfortunately that is a problem in general with such things. https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside just the model name is probably all you need
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
takezo07
takezo07OP4w ago
You mean i have to use "Option 2: Build Docker Image with Model Inside"? No way I can use "Option 1: Deploy Any Model Using Pre-Built Docker Image"?
J.
J.4w ago
not to my knowledge. i have the same issue so ive been meaning to make a library of popular models for myself but yeah 😔. the first option is using a base docker image, where it sets the env variables but that means everytime serverless launches a the base docker image -> its in a default state -> looks at the env variables -> downloads -> run so if u wanna skip the download step it has to already be in the image
takezo07
takezo07OP4w ago
Ok thanks...It's too bad. So it's much more complicated I think. So, I don't really understand ... I understand that I have to make my own Docker image? But should I first download the weights and put them in my Docker image? If I do this, what is the point of having a storage? If I want to put the weights in the storage how do I do it? It is not possible to start for the first time and download the weights and that then it is no longer necessary?
J.
J.4w ago
what u trying to do? what the end goal
takezo07
takezo07OP4w ago
I want a vLLM Serverless Endpoint with a quick coldstart
J.
J.4w ago
nah basically u just give the dockerfile when u run the build command the argument as the docker builds the image to docker cloud it will have everything so whenever u build a dockerfile: it basically creates a snapshot going through all ur instructions and then saves that snapshot in dockercloud so if ur like: docker build mymodel using this dockerfile instruction then will download and build itself u dont need to download and put it in urself manually (tho i guess thatis what is happening as it builds) Hmmmmmmm for the first time, im not sure if vllm supports network storage im actually 🆕 so i havent touched this much i was thinking about network storage too, but i dont see it described in the readme having storage is for stuff that utilizes it. not all repositories utilize network storages. but yeah does provide persistence. but network storage can also be slow for i/o operations still
takezo07
takezo07OP4w ago
Ok, but as we can set environment variables, I thought it was possible. I'm going to look to do my own Dockerfile, but frankly I'm not sure I get there. I know how to build an image and put it on DockerHub. And then deploy it on Runpod. But I am not sure I can do the Dockerfile on its own from zero. The goal is to use VLLM so that it is simple.
J.
J.4w ago
the dockerfile is already there so u can just run the existing dockerfile command with the arguments can ask chatgpt but u shouldnt have to make it should just be download repo cd to it maybe install docker if u havent already and do
takezo07
takezo07OP4w ago
GitHub
worker-vllm/Dockerfile at main · runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
J.
J.4w ago
docker build (ur model) the dockerfile already there and push yeah
J.
J.4w ago
No description
J.
J.4w ago
has examples i just throw the url into chatgpt and tell it what command do i use xD to build and push for these things lol and i give it my hf model url whenever i do these (i dont use vllm) but i use other stuff i made 🥱 if i pass out it 4 am for me just fyi. if i disconnect but hopefully this helps
takezo07
takezo07OP4w ago
ok i start to understand. I will still have to do a lot of research. Because I do not understand how to download the weights locally so that they are then in my Docker image. thanks for you help
J.
J.4w ago
u dont download locally u just give it arguments and as the dockefile builds will doenload such as if i made a file such as arg variable give me a prompt echo variable u dont need to download the “prompt” u just need to do
takezo07
takezo07OP4w ago
Yes but at each cold start, there will be a download?
J.
J.4w ago
build image (variable) ah no. bc once the build command is done
takezo07
takezo07OP4w ago
ok
J.
J.4w ago
it will actually build a snapshot and push it to docker
takezo07
takezo07OP4w ago
i get it
J.
J.4w ago
and then u use that snapshot so instead of using runpod image u use myname/vllm:1.0 wherever u push to creating ur own snapshot basically i would test with a small model first to make sure things work and not waste too much time building
takezo07
takezo07OP4w ago
It's still weird, because with the configuration I gave above, I see that the weights are still downloaded to my storage.
PRE .locks/
PRE huggingface-cache/
PRE models--Qwen--Qwen2.5-VL-3B-Instruct-AWQ/
2025-08-03 11:24:27 0 a4f8ee14169c384a82714c91fbd37cf51eb65e05a5312641e27d7252ee813405Qwen-Qwen2.5-VL-3B-Instruct-AWQ.lock
PRE .locks/
PRE huggingface-cache/
PRE models--Qwen--Qwen2.5-VL-3B-Instruct-AWQ/
2025-08-03 11:24:27 0 a4f8ee14169c384a82714c91fbd37cf51eb65e05a5312641e27d7252ee813405Qwen-Qwen2.5-VL-3B-Instruct-AWQ.lock
It's just that on the next startup, it is not able to reuse the weights already downloaded.

Did you find this page helpful?