Speed up text-generation-interface startup time
I'm running HuggingFace's TGI with LLM models on pods. The startup time can be significant because the first thing the image does is download the entire model I'm running.
For example, it often takes 10-15 minutes from the time I start a Pod until it is available. When we are autoscaling due to load, that is a long time!
Is there any way to speed up how quickly pods start up in this scenario?
5 Replies
if your scaling up and down using pods, i would encourage using serverless, the caching for container image is persistant on serverless, with pods we eventually evict the container images if they're stopped for long
@flash-singh yes, we need to explore using Serverless. Do you know if people are successfully running TGI on serverless? We're currently using pods with 2xH100s...I assume serverless can handle running larger models. And we could consider switching to vLLM, which I see in a lot of the serverless docs - just need to vet any impact to performance at scale.
yes many are running vllm, you can also use load balancer serverless for vllm if you want lower latency for requests
serverless handling large models have more to do with what type of gpu you use and less with serverless itself
OK - we will test that out next week. I guess that your response means there really isn't a way to speed up starting pods though, correct?
no there is not, overtime we remove container images when pod is stopped, this is to help avoid disk issues since there is only so much that can be stored on a single server, the way we avoid this with serverless is by automation, your not assigned a single server but you pick regions and we can move the workload on the fly as needed to servers with lower disk usage and maintain a stable production deployment