R
Runpod15mo ago
Emad

LLAMA 3.1 8B Model Cold Start and Delay time very long

Hey, our cold start time always reaches over a minute and same with delay. For live running we need this to be quicker. We have tried with network volume as well but it doesnt change anything.
No description
16 Replies
Unknown User
Unknown User15mo ago
Message Not Public
Sign In & Join Server To View
Emad
EmadOP15mo ago
Is there no other solution? to control costs as well
Unknown User
Unknown User15mo ago
Message Not Public
Sign In & Join Server To View
Emad
EmadOP15mo ago
It's usually not used every minute. At night our user count is less so it is not used as frequently. The reason runpod was pushed by our team was because we say it gave record cold start times.
Unknown User
Unknown User15mo ago
Message Not Public
Sign In & Join Server To View
Emad
EmadOP15mo ago
But i thought for LLMs the cold start time was in seconds according to the blog posts
Unknown User
Unknown User15mo ago
Message Not Public
Sign In & Join Server To View
Emad
EmadOP15mo ago
I tried through network volume and normally too both give same result
Emad
EmadOP15mo ago
RunPod Blog
Run Larger LLMs on RunPod Serverless Than Ever Before - Llama-3 70B...
Up until now, RunPod has only supported using a single GPU in Serverless, with the exception of using two 48GB cards (which honestly didn't help, given the overhead involved in multi-GPU setups for LLMs.) You were effectively limited to what you could fit in 80GB, so you would essentially be
Unknown User
Unknown User15mo ago
Message Not Public
Sign In & Join Server To View
Emad
EmadOP15mo ago
Yes next time it is faster but for a request after a while it takes over a minute
Unknown User
Unknown User15mo ago
Message Not Public
Sign In & Join Server To View
Emad
EmadOP15mo ago
I thought flashboot was for the first request as well
Unknown User
Unknown User15mo ago
Message Not Public
Sign In & Join Server To View
yhlong00000
yhlong0000015mo ago
When you create an endpoint, the worker first needs to download the images. Depending on the size of the model you’re running, this can take some time. If you send a request during this initial phase, it will remain in the queue and won’t be processed because the worker isn’t ready to serve yet. Once the worker is initialized, performance will depend on your request traffic pattern, idle timeout settings, and the minimum number of workers you’ve configured. If your requests are sporadic and there are no active workers, you will experience a cold start delay. However, if you have a steady stream of requests, you’ll benefit from faster response times.
Unknown User
Unknown User15mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?