Runpod•4w ago

vLLM - How to avoid downloading weights every time?

I have a Serverless Endpoint with vLLM. Using this Docker Image : runpod/worker-v1-vllm:v2.7.0stable-cuda12.1.0 My ENV var:

MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct-AWQ
DOWNLOAD_DIR=/runpod-volume
DTYPE=float16
GPU_MEMORY_UTILIZATION=0.90
ENABLE_PREFIX_CACHING=0
QUANTIZATION=awq_marlin
LIMIT_MM_PER_PROMPT=image=1,video=0
MAX_MODEL_LEN=16384
ENFORCE_EAGER=true
TRUST_REMOTE_CODE=true
VLLM_IMAGE_FETCH_TIMEOUT=10
HF_HOME=/runpod-volume/huggingface-cache
TRANSFORMERS_CACHE=/runpod-volume/huggingface-cache

MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct-AWQ
DOWNLOAD_DIR=/runpod-volume
DTYPE=float16
GPU_MEMORY_UTILIZATION=0.90
ENABLE_PREFIX_CACHING=0
QUANTIZATION=awq_marlin
LIMIT_MM_PER_PROMPT=image=1,video=0
MAX_MODEL_LEN=16384
ENFORCE_EAGER=true
TRUST_REMOTE_CODE=true
VLLM_IMAGE_FETCH_TIMEOUT=10
HF_HOME=/runpod-volume/huggingface-cache
TRANSFORMERS_CACHE=/runpod-volume/huggingface-cache

In the worker logs i have :

2025-08-03T09:30:56.128364829Z INFO 08-03 09:30:56 [model_runner.py:1171] Starting to load model Qwen/Qwen2.5-VL-3B-Instruct-AWQ...
2025-08-03T09:30:56.371946022Z WARNING 08-03 09:30:56 [vision.py:91] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
2025-08-03T09:30:56.833587201Z INFO 08-03 09:30:56 [weight_utils.py:292] Using model weights format ['*.safetensors']
2025-08-03T09:31:08.946316697Z INFO 08-03 09:31:08 [weight_utils.py:308] Time spent downloading weights for Qwen/Qwen2.5-VL-3B-Instruct-AWQ: 12.112183 seconds
2025-08-03T09:31:09.112774049Z INFO 08-03 09:31:09 [weight_utils.py:345] No model.safetensors.index.json found in remote.
2025-08-03T09:31:09.114026816Z 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
2025-08-03T09:31:11.829100920Z 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.71s/it]
2025-08-03T09:31:11.829134000Z 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.72s/it]
2025-08-03T09:31:11.943191906Z INFO 08-03 09:31:11 [default_loader.py:272] Loading weights took 2.83 seconds
2025-08-03T09:31:12.506203683Z INFO 08-03 09:31:12 [model_runner.py:1203] Model loading took 3.3186 GiB and 15.732077 seconds

2025-08-03T09:30:56.128364829Z INFO 08-03 09:30:56 [model_runner.py:1171] Starting to load model Qwen/Qwen2.5-VL-3B-Instruct-AWQ...
2025-08-03T09:30:56.371946022Z WARNING 08-03 09:30:56 [vision.py:91] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
2025-08-03T09:30:56.833587201Z INFO 08-03 09:30:56 [weight_utils.py:292] Using model weights format ['*.safetensors']
2025-08-03T09:31:08.946316697Z INFO 08-03 09:31:08 [weight_utils.py:308] Time spent downloading weights for Qwen/Qwen2.5-VL-3B-Instruct-AWQ: 12.112183 seconds
2025-08-03T09:31:09.112774049Z INFO 08-03 09:31:09 [weight_utils.py:345] No model.safetensors.index.json found in remote.
2025-08-03T09:31:09.114026816Z 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
2025-08-03T09:31:11.829100920Z 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.71s/it]
2025-08-03T09:31:11.829134000Z 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.72s/it]
2025-08-03T09:31:11.943191906Z INFO 08-03 09:31:11 [default_loader.py:272] Loading weights took 2.83 seconds
2025-08-03T09:31:12.506203683Z INFO 08-03 09:31:12 [model_runner.py:1203] Model loading took 3.3186 GiB and 15.732077 seconds

22 Replies

J.•4w ago

got to bake it into the image. the github repo got some instructions / you can point chatgpt to it to ask for further instructions. Unfortunately that is a problem in general with such things. https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside just the model name is probably all you need

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

takezo07OP•4w ago

You mean i have to use "Option 2: Build Docker Image with Model Inside"? No way I can use "Option 1: Deploy Any Model Using Pre-Built Docker Image"?

J.•4w ago

not to my knowledge. i have the same issue so ive been meaning to make a library of popular models for myself but yeah 😔. the first option is using a base docker image, where it sets the env variables but that means everytime serverless launches a the base docker image -> its in a default state -> looks at the env variables -> downloads -> run so if u wanna skip the download step it has to already be in the image

takezo07OP•4w ago

Ok thanks...It's too bad. So it's much more complicated I think. So, I don't really understand ... I understand that I have to make my own Docker image? But should I first download the weights and put them in my Docker image? If I do this, what is the point of having a storage? If I want to put the weights in the storage how do I do it? It is not possible to start for the first time and download the weights and that then it is no longer necessary?

J.•4w ago

what u trying to do? what the end goal

takezo07OP•4w ago

I want a vLLM Serverless Endpoint with a quick coldstart

J.•4w ago

nah basically u just give the dockerfile when u run the build command the argument as the docker builds the image to docker cloud it will have everything so whenever u build a dockerfile: it basically creates a snapshot going through all ur instructions and then saves that snapshot in dockercloud so if ur like: docker build mymodel using this dockerfile instruction then will download and build itself u dont need to download and put it in urself manually (tho i guess thatis what is happening as it builds) Hmmmmmmm for the first time, im not sure if vllm supports network storage im actually 🆕 so i havent touched this much i was thinking about network storage too, but i dont see it described in the readme having storage is for stuff that utilizes it. not all repositories utilize network storages. but yeah does provide persistence. but network storage can also be slow for i/o operations still

takezo07OP•4w ago

Ok, but as we can set environment variables, I thought it was possible. I'm going to look to do my own Dockerfile, but frankly I'm not sure I get there. I know how to build an image and put it on DockerHub. And then deploy it on Runpod. But I am not sure I can do the Dockerfile on its own from zero. The goal is to use VLLM so that it is simple.

J.•4w ago

the dockerfile is already there so u can just run the existing dockerfile command with the arguments can ask chatgpt but u shouldnt have to make it should just be download repo cd to it maybe install docker if u havent already and do

takezo07OP•4w ago

hoo you mean this one : https://github.com/runpod-workers/worker-vllm/blob/main/Dockerfile ?

GitHub

worker-vllm/Dockerfile at main · runpod-workers/worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

J.•4w ago

docker build (ur model) the dockerfile already there and push yeah

J.•4w ago

J.•4w ago

has examples i just throw the url into chatgpt and tell it what command do i use xD to build and push for these things lol and i give it my hf model url whenever i do these (i dont use vllm) but i use other stuff i made 🥱 if i pass out it 4 am for me just fyi. if i disconnect but hopefully this helps

takezo07OP•4w ago

ok i start to understand. I will still have to do a lot of research. Because I do not understand how to download the weights locally so that they are then in my Docker image. thanks for you help

J.•4w ago

u dont download locally u just give it arguments and as the dockefile builds will doenload such as if i made a file such as arg variable give me a prompt echo variable u dont need to download the “prompt” u just need to do

takezo07OP•4w ago

Yes but at each cold start, there will be a download?

J.•4w ago

build image (variable) ah no. bc once the build command is done

takezo07OP•4w ago

J.•4w ago

it will actually build a snapshot and push it to docker

takezo07OP•4w ago

i get it

J.•4w ago

and then u use that snapshot so instead of using runpod image u use myname/vllm:1.0 wherever u push to creating ur own snapshot basically i would test with a small model first to make sure things work and not waste too much time building

takezo07OP•4w ago

It's still weird, because with the configuration I gave above, I see that the weights are still downloaded to my storage.

                       PRE .locks/
                           PRE huggingface-cache/
                           PRE models--Qwen--Qwen2.5-VL-3B-Instruct-AWQ/
2025-08-03 11:24:27          0 a4f8ee14169c384a82714c91fbd37cf51eb65e05a5312641e27d7252ee813405Qwen-Qwen2.5-VL-3B-Instruct-AWQ.lock

                       PRE .locks/
                           PRE huggingface-cache/
                           PRE models--Qwen--Qwen2.5-VL-3B-Instruct-AWQ/
2025-08-03 11:24:27          0 a4f8ee14169c384a82714c91fbd37cf51eb65e05a5312641e27d7252ee813405Qwen-Qwen2.5-VL-3B-Instruct-AWQ.lock

It's just that on the next startup, it is not able to reuse the weights already downloaded.

Gaming

Programming

vLLM - How to avoid downloading weights every time?

Did you find this page helpful?