VLLM model loading, TTFT unhappy path

I am looking for a way to reduce the latency for the unhappy path of VLLM endpoints. I use the quickstart VLLM template, backed by a network storage for model weights and flashboot enabled. By default the worker will load the model weights on first request. This, however poses the risk of exposing my customers to an unhappy path of latency measured in minutes, at scale we could see this in significant absolute numbers. What would be the best way for me to make sure that a worker is considered ready only >after< it has loaded the model checkpoints, and trigger checkpoint loading without sending the first request? Should I roll my own VLLM container image? Or is there an idiomatic way to parametrize the quickstart template to achieve this? I would prefer to use the Runpod supplied, properly supported VLLM image, if possible.
No description
1 Reply
Unknown User
Unknown User11mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?