Questions on preventing model reloads in Serverless inference
Hello,
I am experimenting with implementing a serverless image generation API by serving my own model through Docker.
During testing, I observed the following behavior:
    •    On the first request, the model requires a checkpoint pinning process into VRAM, which introduces about one minute of loading delay.
    •    After loading, inference is immediate and smooth.
    •    With requests spaced about one hour apart, the VRAM pinning seems to persist and no extra delay occurs.
    •    However, if there is no request for around 24 hours, the model appears to go through the pinning process again.
My questions are:
    1.    Is my understanding correct that as long as requests are made periodically, the model remains pinned in VRAM?
    2.    How frequently should requests be made to ensure that pinning is not released?
    3.    Are there any recommended approaches or best practices to prevent pinning from being released (e.g., configuration options, serverless policies, pre-warm intervals)?
My goal is to achieve immediate API responses while minimizing the cold-start delay caused by model loading. Any guidance or best practices you can share would be greatly appreciated.
Thank you.
0 Replies