Run LLM Model on Runpod Serverless

Hi There, I have LLM Model which build on docker image and it was 40GB++ docker Image. I'm wondering, can I mount the model as volume instead of add the model in the docker image? Thanks !
31 Replies
ashleyk
ashleyk4mo ago
Yes, you can put your model on network storage and load it from there, but its generally more performant to bake the model into the docker image because network storage is incredibly slow. Network storage also limits GPU availability.
TumbleWeed
TumbleWeed4mo ago
Is that okay to bake the model into the docker image @ashleyk ? does it affected the cold start?
ashleyk
ashleyk4mo ago
No, loading it from network storage affects cold start more.
TumbleWeed
TumbleWeed4mo ago
How about the pulling image strategy, does the runpod cache the image to internal registry, or it would pull the image everytime the worker spawned?
ashleyk
ashleyk4mo ago
Your docker image is cached onto the workers in advance so has no impact on cold start times.
TumbleWeed
TumbleWeed4mo ago
Alright thank you, I will try it first @ashleyk I have tried to setup the serverless endpoint, how to check the logs of the pull? how do I know if the worker success pull the image
ashleyk
ashleyk4mo ago
Click on each worker and check. The workers will go "Idle" when they are done pulling the image.
TumbleWeed
TumbleWeed4mo ago
It stucked on "Initializing" Does it returned error if the pull image fail? let say I have missconfigured the registry access
ashleyk
ashleyk4mo ago
Click on the workers to check the logs.
TumbleWeed
TumbleWeed4mo ago
I see, does I got charged when the worker is on initializing state?
ashleyk
ashleyk4mo ago
No, only while the container is running - cold start + execution time.
TumbleWeed
TumbleWeed4mo ago
Wow, okay
justin
justin4mo ago
If u can share the screenshot of ur template also be good sometimes ppl forget the tag so just double check such as username/image:1.0 some ppl just write username/image
Alpay Ariyak
Alpay Ariyak4mo ago
Just use our pre-made worker vLLM image and attach a network volume On startup, the worker will download the model to the network storage, and all the workers will have access to it The image itself is only 3gb as well and no need to build it
TumbleWeed
TumbleWeed4mo ago
I have sucessfully run my model, but need some adjustment, because inside the container, I still run the FastAPI for endpoint.
ashleyk
ashleyk4mo ago
Yeah you don't need FastAPI for serverless, serverless already provides an API layer for you.
TumbleWeed
TumbleWeed4mo ago
Hi @ashleyk @Alpay Ariyak I have tried deploying my LLM Model on runpod serverless, the image is over 40GB, it doesn't cost wise to use Google Artifact Registry as they charging for egress outside of GCP Network, any recommendations for the container registry? Thank you
ashleyk
ashleyk4mo ago
I just use Dockerhub, its free
TumbleWeed
TumbleWeed4mo ago
But they limit the pull requests right?
ashleyk
ashleyk4mo ago
Yeah they have rate limiting by IP, but you can use your token to authenticate instead I use dockerhub in my production serverless endpoints and never had any issues.
justin
justin4mo ago
I was testing Llama models / mistral models that are close to the 35Gb/40Gb mark through dockerhub no issues - u can add ur docker credentials to runpod too Also dockerhub has one private repo per account - if u have it as something sensitive; which then obvs need to add ur docker credential to runpod settings
TumbleWeed
TumbleWeed4mo ago
Alright, I will try with dockerhub then, Thank you @ashleyk @justin
justin
justin4mo ago
Cant believe google charges u egress for a registry 👁️ i know they do for gcp bucket data but rlly for a container registry? damn
ashleyk
ashleyk4mo ago
Google, AWS and Azure all charge massive egress costs for everything
TumbleWeed
TumbleWeed4mo ago
I have move my LLM model to Dockerhub, so I don't get haunted by gcp egress cost lol I have another question my cold start (load the LLM Model ) is around 15s-30s, anyway to optimize it? @ashleyk @justin
ashleyk
ashleyk4mo ago
1. Enable FlashBoot if you haven't already done so. 2. Load the model outside of runpod.serverless.start() so that it is cached into the worker and not loaded on every single request.
TumbleWeed
TumbleWeed4mo ago
So, its possible to preload the model into the worker ?
ashleyk
ashleyk4mo ago
You can also look at setting Active workers, but you are charged for those. Thats basically what FlashBoot does but it doesn't really provide any benefit unless you have a constant flow of requests.
TumbleWeed
TumbleWeed4mo ago
I would try the 2nd option and let you know the result
justin
justin4mo ago
No But what he is saying is do: model = load(model) def handler(): model.predict() the load model will get added to ur delay time but on subsequent requests if the worker is still active and didnt spin down to take other requests doesnt need to reload into memory if u had it in function scope would reset the variable
Alpay Ariyak
Alpay Ariyak4mo ago
Hi @WillyRL, is there a reason you don't want to use https://github.com/runpod-workers/worker-vllm ? It solves all of your problems already
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm