Runpod•5mo ago

serverless does not cache at all

So becuase the serverless vllm worker did not have a thing I needed I changed it a bit and uplaoded my own docker image of it. But now after each request it has to load the model compeltly again and it takes 90 seconds each time. Like I do a request the worker load 90s does the requst then goes offline again after the 5s timeout i set and then I send another request and it has to do the 90s loading again. Like it does all that each time. I use a vllm endpoint on serrverless wiht a differnt model and it does not load this long, in fact its below 1 second even after the workers went offline again after the timeout. Why is that, im already using network storage for the model. Here is the logs of what has to happen on each request which takes 90 seconds.

logs_53.txt

10 Replies

FoopopOP•5mo ago

I thikn I have to do this right? Embed models in Docker images: For production environments, consider packaging your ML models directly within your worker container image instead of downloading them in your handler function. This places models on high-speed local storage. I tried that before but the model is 30GB and when I tried to push that docker layer it failed with a http exception cuz the layer is to big or smth. And my wifi is slow af so it took an enternity to upload that.

3WaD•5mo ago

Do you have Flashboot enabled on the endpoint?

FoopopOP•5mo ago

yeah I do Do I understand it correct that after some idle time the worker has to load the whoe safe tensors again BUT if you call the worker like after a few mins even if the timeout is 5seconds the safe tensros are still cached or smth

3WaD•5mo ago

Yes. Flashboot keeps the idle workers warm. More or less. If there's not too much traffic

FoopopOP•5mo ago

Yeah right But why doesnt it work with my docker image - do you think it is cuz the model is not baked in the docker image? Im trying a docker image with the model build in right now but it takes time to test this

3WaD•5mo ago

Aren't you forcing the engine to reinitialize on each run with the code you customized? The handler should check if the engine already exists.

FoopopOP•5mo ago

I cloned this and my only customization is that I add an arg to the engine like one line change thats it. This one line is the reason I have to do all this https://github.com/runpod-workers/worker-vllm If I understand it correct this is the same thing that runopd serverless vllm uses when I launch an instance via the console on runpod so it should work the same right In enginge_args.py I added this line thats it # Add the new multimodal limit parameter "limit_mm_per_prompt": convert_limit_mm_per_prompt(os.getenv('LIMIT_MM_PER_PROMPT', 'image=1,video=0')), Without that I cant limit video to 0 and cant use Internvl

3WaD•5mo ago

Then it should work. Try to observe what's happening with the worker once it does the cold start, and if the next request after it goes into the same worker.

FoopopOP•5mo ago

The workers just load the model like the safe tensors on each request when they are in idle mode Like it say removing container at the end each time or smth I think it’s cuz the model is lot baked in If the image is build and uploaded I’ll try that

3WaD•5mo ago

I think that's the message when the worker is removed from the endpoint? I don't see that after the request when the worker stays warm for sure.

Gaming

Programming

serverless does not cache at all

Did you find this page helpful?