Guidance on Mitigating Cold Start Delays in Serverless Inference
We are experiencing delays during the cold starts of our serverless server used for inference of a machine learning model (Whisper). The main suspected cause is the download of model weights (custom model trained by us), which are fetched via the Hugging Face package within the Python code. We are exploring possible solutions and need guidance on feasibility and best practices.
Additional Context:
   - The inference server currently fetches model weights dynamically from Hugging Face during initialization, leading to delays.
   -  The serverless platform is being used for inference as part of a production system requiring low latency.
   - We offer streaming inference, where low latency is critical for usability. Currently, many calls experience delays exceeding 5 seconds, making the solution unfeasible for our purposes (check img showing the delay times from this month).
Solution Options
1. Injecting the Model Directly into the Docker Container
This involves embedding the fully trained model within the Docker container. The server would bring up the container with the model file already included.
Cons: This will result in significantly larger Docker images.
Questions:
      - Will this approach impact cold-start times, considering the increased image size?
      - Are there recommended limits on Docker image sizes for serverless environments?
2. Using Storage - Disk Volume
Store the model weights on a disk volume provided by the cloud provider. The serverless instance would mount the storage to access the weights.
Cons: Potential additional storage costs.
Questions:
       - Does the serverless platform support disk volume storage? We could only find documentation about using storage with Pods.
       - If supported, is mounting disk volume storage expected to improve cold-start performance?
3. Using Storage - Network Storage
Host the model weights on an intranet storage solution to enable faster downloads (compared to public repositories like Hugging Face).
Cons: Possible network storage costs and additional management overhead.
Questions:
    - Does the serverless platform support network storage for serverless instances? Again, documentation appears focused on Pods.
    - Are there recommendations or best practices for integrating network storage with serverless instances?
We would like some guidance on which approach should we pursue, considering we would like to be using for streaming inference.
If any of those options are not optimal, could you suggest an alternative?

20 Replies
Unknown User•10mo ago
Message Not Public
Sign In & Join Server To View
@nerdylive 
Actually, we download the models only during the build. So they are not being downloaded again during cold starts. However, we still think the "normal" cold starts are too big, taking about 10s (loading the model themselves usually take about 2-5s).
Furthermore, we have no idea why in some rare cases it takes an absurd amount of time, like the >100s. This is our biggest problem.
Unknown User•10mo ago
Message Not Public
Sign In & Join Server To View
Yeah. Sometimes it did on specific workers. I used faster-whisper to load them. And there's nothing failing.
But these still do not explain how I got more than 100s of delay time.
Unknown User•10mo ago
Message Not Public
Sign In & Join Server To View
@nerdylive I noticed that sometimes a worker takes too much time to completely setup a docker image, and sometimes the worker that is "downloading" the docker image is set as "idle" instead of "initializing". I think this is a bug. What can happen in this case is that a request may be allocated to this bugged worker, and I believe this is why the delay time may be huge sometimes.
Would using a Network Volume solve this problem? Note: I already download the models when building the docker image, so they're already cached. The problem is when a new worker is started and it needs to build the docker image. My image has 8 GiB total, so it's not that big. But the download parts take too much time through RunPod.
Or is the Network Volume completely unrelated in this case?
Unknown User•9mo ago
Message Not Public
Sign In & Join Server To View
I passed both info to support yesterday:
request id: 
sync-2fbf700d-b754-44d2-8df2-9ac9fb536005-u1
worker id: l8q3x9g7a1prqj
While I'm not 100% sure that this happened (since I did not annotate the exact worker id), I noticed in the log  that the worker that had a "running" status was downloading the docker image.
But after the worker executed the request, the previous log disappeared.Unknown User•9mo ago
Message Not Public
Sign In & Join Server To View
Thanks thanks
Unknown User•9mo ago
Message Not Public
Sign In & Join Server To View
Oh yeah, did it already.
Unknown User•9mo ago
Message Not Public
Sign In & Join Server To View
@nerdylive the problem I mentioned is actually happening right now

I don't believe this worker should be considered idle
Unknown User•9mo ago
Message Not Public
Sign In & Join Server To View
I'm using the same endpoint, just terminated the other workers as a test.
I did refresh the page but with F5.
Unknown User•9mo ago
Message Not Public
Sign In & Join Server To View
I mean, now it's okay since it's been some time since everything downloaded.
I had enabled the network volume before, thinking it could be a solution. Then I disabled it and terminated all the workers to get new ones on the "latest version" of the endpoint. Some workers already had the cached docker image (probably because I used them before), but the ones that didn't needed to download it.
And I see this all the time: different workers downloading the image, even in the same endpoint. I thought it was standard.
Unknown User•9mo ago
Message Not Public
Sign In & Join Server To View