RunPod•15mo ago

Run LLM Model on Runpod Serverless

Hi There, I have LLM Model which build on docker image and it was 40GB++ docker Image. I'm wondering, can I mount the model as volume instead of add the model in the docker image? Thanks !

31 Replies

ashleyk•15mo ago

Yes, you can put your model on network storage and load it from there, but its generally more performant to bake the model into the docker image because network storage is incredibly slow. Network storage also limits GPU availability.

TumbleWeedOP•15mo ago

Is that okay to bake the model into the docker image @ashleyk ? does it affected the cold start?

ashleyk•15mo ago

No, loading it from network storage affects cold start more.

TumbleWeedOP•15mo ago

How about the pulling image strategy, does the runpod cache the image to internal registry, or it would pull the image everytime the worker spawned?

ashleyk•15mo ago

Your docker image is cached onto the workers in advance so has no impact on cold start times.

TumbleWeedOP•15mo ago

Alright thank you, I will try it first @ashleyk I have tried to setup the serverless endpoint, how to check the logs of the pull? how do I know if the worker success pull the image

ashleyk•15mo ago

Click on each worker and check. The workers will go "Idle" when they are done pulling the image.

TumbleWeedOP•15mo ago

It stucked on "Initializing" Does it returned error if the pull image fail? let say I have missconfigured the registry access

ashleyk•15mo ago

Click on the workers to check the logs.

TumbleWeedOP•15mo ago

I see, does I got charged when the worker is on initializing state?

ashleyk•15mo ago

No, only while the container is running - cold start + execution time.

TumbleWeedOP•15mo ago

Wow, okay

J.•15mo ago

If u can share the screenshot of ur template also be good sometimes ppl forget the tag so just double check such as username/image:1.0 some ppl just write username/image

Alpay Ariyak•15mo ago

Just use our pre-made worker vLLM image and attach a network volume On startup, the worker will download the model to the network storage, and all the workers will have access to it The image itself is only 3gb as well and no need to build it

TumbleWeedOP•15mo ago

I have sucessfully run my model, but need some adjustment, because inside the container, I still run the FastAPI for endpoint.

ashleyk•15mo ago

Yeah you don't need FastAPI for serverless, serverless already provides an API layer for you.

TumbleWeedOP•15mo ago

Hi @ashleyk @Alpay Ariyak I have tried deploying my LLM Model on runpod serverless, the image is over 40GB, it doesn't cost wise to use Google Artifact Registry as they charging for egress outside of GCP Network, any recommendations for the container registry? Thank you

ashleyk•15mo ago

I just use Dockerhub, its free

TumbleWeedOP•15mo ago

But they limit the pull requests right?

ashleyk•15mo ago

Yeah they have rate limiting by IP, but you can use your token to authenticate instead I use dockerhub in my production serverless endpoints and never had any issues.

J.•15mo ago

I was testing Llama models / mistral models that are close to the 35Gb/40Gb mark through dockerhub no issues - u can add ur docker credentials to runpod too Also dockerhub has one private repo per account - if u have it as something sensitive; which then obvs need to add ur docker credential to runpod settings

TumbleWeedOP•15mo ago

Alright, I will try with dockerhub then, Thank you @ashleyk @justin

J.•15mo ago

Cant believe google charges u egress for a registry 👁️ i know they do for gcp bucket data but rlly for a container registry? damn

ashleyk•15mo ago

Google, AWS and Azure all charge massive egress costs for everything

TumbleWeedOP•15mo ago

I have move my LLM model to Dockerhub, so I don't get haunted by gcp egress cost lol I have another question my cold start (load the LLM Model ) is around 15s-30s, anyway to optimize it? @ashleyk @justin

ashleyk•15mo ago

1. Enable FlashBoot if you haven't already done so. 2. Load the model outside of runpod.serverless.start() so that it is cached into the worker and not loaded on every single request.

TumbleWeedOP•15mo ago

So, its possible to preload the model into the worker ?

ashleyk•15mo ago

You can also look at setting Active workers, but you are charged for those. Thats basically what FlashBoot does but it doesn't really provide any benefit unless you have a constant flow of requests.

TumbleWeedOP•15mo ago

I would try the 2nd option and let you know the result

J.•15mo ago

No But what he is saying is do: model = load(model) def handler(): model.predict() the load model will get added to ur delay time but on subsequent requests if the worker is still active and didnt spin down to take other requests doesnt need to reload into memory if u had it in function scope would reset the variable

Alpay Ariyak•15mo ago

Hi @WillyRL, is there a reason you don't want to use https://github.com/runpod-workers/worker-vllm ? It solves all of your problems already

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Gaming

Programming

Run LLM Model on Runpod Serverless

Did you find this page helpful?