How do people serve large models on Runpod Serverless?
Hi all—looking for real-world advice on shipping model with large weight size (60 GB+) as a Runpod Serverless endpoint. I seem stuck between two awkward choices:
1) Embed the weights in the Docker image
Pros: Once the image lands on a worker, cold-start only covers copying weights into GPU RAM.
Cons: A ~70 GB image is painful to build—most CI runners don’t have that much local disk, it usually takes hours to build such big image and Runpod support says very large images have slower initial rollout.
2) Keep the weights on a Runpod Network Volume
Pros: Tiny image, so CI/CD is easy.
Cons: Every cold start streams 60 GB from the volume; I’m seeing ~2 min cold start, which costs more than the actual inference.
Am I correct that, until I either (a) have enough steady traffic to justify paying for “active workers” that stay warm, or (b) pay for beefy CI servers with the disk to build a 70 GB image, this problem won’t go away?
Is that really the trade-off? How are people actually serving large models?
Would love pointers, numbers, or examples—thanks!
9 Replies
The harsh truth is - they don't. With scaling to 0, serverless seems like a great way to start things off on paper. In reality, serverless has never been designed for tasks like working with huge LLMs. Companies like to market it that way nowadays, but it simply can't technically work.
Even the frameworks are not designed for it. Take vLLM for instance - you're supposed to have a long startup where tons of optimisations like torch.compile, graph capture or dummy request caching happen, so you get as much consistent inference speed as possible afterwards. It's not meant to load and close in milliseconds over and over.
Your best bet to make this close to working is to build that huge image locally, upload it to DockerHub, run it on RunPod, and keep your endpoint warm with periodic/API-data-driven pre-warm requests to keep the responsivness in usable values. Still, there are problems with this:
- If everyone does this trick, RunPod usability will get even worse than it already is
- You still pay for the huge cold-start engine+model initialisation times, which only get worse with more frequent worker shifting caused by the platform's busyness.
- It still will be both much slower and more expensive than any other solution.
What are the alternatives?
- Shared public model endpoints like those from providers on OpenRouter. If you find your favourite model, or even a provider that allows you to use a custom runtime LoRA adapter along with the models (so you can fine-tune them), you've won. Downsides? The provider can decide at any moment that it will not host that model anymore.
- Buying your own hardware and hosting it yourself. Sounds crazy? Well, you can buy all consumer-grade GPUs used on Runpod in around 4 months' worth of Pod rent. So in the long term, it's much more logical to take a loan to buy a GPU yourself.
Not the OP but this is the hard truth I needed to hear early on in the process of building my video generator service. Thank you for that. Could you tell me a little more about openrouter? I don't fully understand how it works yet. It doesn't directly serve any video generators like ltx or hunyuan or wan, but you can use gemini 2.5 and other models that support video gen, is that correct?
I personally use OpenRouter just to keep track of reputable providers, since, as you say, so far they have only text generation sorted out, while the providers themselves may already offer those other models. For example, DeepInfra has text, TTS, STT, embeddings, ranking, image and video models available as shared public endpoints billed per token/output. One still has to be careful to choose the right one, not only to provide what you need, but also to be secure and stable. But if you find one, it's currently the best way to start. That's why I also think that RunPod should completely shift from the serverless model towards shared endpoints, which they already started with Flux
You dropped this, King: 👑 Thank you!
Oooh that flux endpoint is killer, that's 100% what I wish runpod had for video gen. However, the slight complexity in set-up is the only thing stopping every wanna-be entrepreneur like me from starting their own video gen service and over-saturating the market which cheap video gen lol Prolly for the best that it takes a tiny bit of effort still xD Thanks again!
Thank you for the clarification! I agree that with a heavy workload, having your own hardware is always a good idea.
I would also be happy to switch to on-demand servers, but scalability is still an issue. Most of the time, I host custom models (e.g., try-on or 3D estimation), which obviously aren’t available via shared endpoints.
Btw, serverless is still an option for me in certain cases—particularly when my time commitment is minimal and I can deliver results comfortably with a 5–10 minute delay. In those cases, waiting a bit longer is preferable to investing extra hours to build robust infrastructure to handle random bursts of traffic.
Make sure to check back every once in a while, we're working on expanding what models we offer in public endpoints!
We also are actively thinking about this overall problem as well, so keep your eyes peeled for updates in the future
Love to hear it! Will do!
Unknown User•4mo ago
Message Not Public
Sign In & Join Server To View
Hi all,
I know this is an old thread, but you may find some happiness in this update.
We have added some new models:
- alibaba / wan t2v 720p
- alibaba / wan i2v 720p
- black-forest-labs / flux.1 kontext [dev]
There will be more added in the future.