Runpod•11mo ago

Need help in fixing long running deployments in serverless vLLM

Hi, I am trying to deploy migtissera/Tess-3-Mistral-Large-2-123B model on serverless using 8 48GB GPUs using vLLM.

The total size of model weights around 245 GB.

I have tried two ways: 1st way without any network volume, it takes really long time to serve first request as it needs to download the weights. Then if the worker goes to idle and I sent a request again it downloads weights and takes long time.

2nd way: I tried to use a 300GB network storage but it is usually gets stuck at half of model weights downloading and then the worker gets killed.

I am losing money fast because of this. Please help.
I have attached all the screenshots.

zethosOP•1/30/25, 5:10 AM

@nerdylive Could you please help? I tried loading the model to network volume using a pod and then attach the network volume to the serverless instaance. Still, its taking time to load.

Jason•1/30/25, 5:50 AM

1 seems normal, 2nd way because of execution timeout? In your endpoint?

Zzethos @nerdylive Could you please help? I tried loading the model to network volume us...

Jason•1/30/25, 5:50 AM

Maybe if it's too slow try downloading when you make the image

Jason•1/30/25, 5:51 AM

Docker image for serverless template

JJason 1 seems normal, 2nd way because of execution timeout? In your endpoint?

zethosOP•1/30/25, 6:04 AM

yes. I increased the execution timeout and it works for some time. And then when worker goes idle, again it needs to load the models and a 15-20 misn wait.

Jason•1/30/25, 6:05 AM

yes thats normal, you will wanna see if this loads your model faster

JJason Maybe if it's too slow try downloading when you make the image

Jason•1/30/25, 6:05 AM

this way, download via a command in your dockerfile, so its built in your image

JJason Docker image for serverless template

zethosOP•1/30/25, 6:07 AM

Yes, this is what I am going to do. The model has around 51 files with total siz around 240 GB.
I am thinking of building the docker image with the whole 245 GB of files inside using the Option 2 montioned here: https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside

Do you think it will be too much 245 GB + take around 20 GB for Ubuntu and cuda drivers.

Jason•1/30/25, 6:08 AM

hmm yeah im not sure what will be too big, but aslong it works on your registry and runpod you can try it

Jason•1/30/25, 6:09 AM

i havent tried building a docker image with model that size

JJason hmm yeah im not sure what will be too big, but aslong it works on your registry ...

zethosOP•1/30/25, 6:10 AM

yeah, previously I tried with many whisper and Bert based models in a single docker image itself and it worked.
May be beacuse of docker image size is small. Hardly 20 GB max.

JJason hmm yeah im not sure what will be too big, but aslong it works on your registry ...

zethosOP•1/30/25, 6:10 AM

yeah, thanks.

JJason Docker image for serverless template

zethosOP•1/30/25, 6:11 AM

Can you tell me the exact template? Or, you were refering to the vLLM template?

Jason•1/30/25, 6:28 AM

you make your own serverless template, image

Zzethos Yes, this is what I am going to do. The model has around 51 files with total siz...

Jason•1/30/25, 6:28 AM

follow the guide in the vllm there?

zethosOP•1/30/25, 6:35 AM

yeah, also dockerhub has size limit of 100GB, so I cannot put modelfiles inside docker and upload to dockerhub

Jason•1/30/25, 6:46 AM

Need help in fixing long running deployments in serverless vLLM

Similar Threads

Similar Threads

Similar Threads