How do people serve large models on Runpod Serverless?
Hi all—looking for real-world advice on shipping model with large weight size (60 GB+) as a Runpod Serverless endpoint. I seem stuck between two awkward choices:
1) Embed the weights in the Docker image
Pros: Once the image lands on a worker, cold-start only covers copying weights into GPU RAM.
Cons: A ~70 GB image is painful to build—most CI runners don’t have that much local disk, it usually takes hours to build such big image and Runpod support says very large images have slower initial rollout.
2) Keep the weights on a Runpod Network Volume
Pros: Tiny image, so CI/CD is easy.
Cons: Every cold start streams 60 GB from the volume; I’m seeing ~2 min cold start, which costs more than the actual inference.
Am I correct that, until I either (a) have enough steady traffic to justify paying for “active workers” that stay warm, or (b) pay for beefy CI servers with the disk to build a 70 GB image, this problem won’t go away?
Is that really the trade-off? How are people actually serving large models?
Would love pointers, numbers, or examples—thanks!
1) Embed the weights in the Docker image
Pros: Once the image lands on a worker, cold-start only covers copying weights into GPU RAM.
Cons: A ~70 GB image is painful to build—most CI runners don’t have that much local disk, it usually takes hours to build such big image and Runpod support says very large images have slower initial rollout.
2) Keep the weights on a Runpod Network Volume
Pros: Tiny image, so CI/CD is easy.
Cons: Every cold start streams 60 GB from the volume; I’m seeing ~2 min cold start, which costs more than the actual inference.
Am I correct that, until I either (a) have enough steady traffic to justify paying for “active workers” that stay warm, or (b) pay for beefy CI servers with the disk to build a 70 GB image, this problem won’t go away?
Is that really the trade-off? How are people actually serving large models?
Would love pointers, numbers, or examples—thanks!