Hi Runpod! we've been using serverless pods for quite a while now. most of our customer serving ran in the background, on demand, which means we could tolerate the long warmup times.
However, to meet demands as per our customers we have made several key improvements in our generation times.
That being said, our main bottleneck today is the infrastructure itself.
We use quite a bit of models to perform the work for our customers, and have tried 3 different paths:
1. Working with images from a private registry that contained the models - was untolerable, the images kept re-downloading layers that were not altered, making it unfeasible to sustain through development unless we seperate only the models. and even then, whenever we need to add a new lora etc -- causes a lot of issues.
2. Downloading the models on boot time - too slow, and unreliable.
3. Using a network drive - this is our current setup. we use network drive to persist all our models. but this comes with a caveat. loading models into the gpu from a network drive is at least 10x slower than it is than container volume, we are talking long minuets instead of short seconds.
We need a better solution for this, and want to understand what is the current best practice that we can use to reduce this