I’m currently running Llama-3.3 in production on RunPod Serverless using vLLM, with a worker that remains warm and handles continuous traffic successfully.
I’m now trying to upgrade this setup to Llama-4, and I’m looking for official guidance on how this should be configured on Serverless rather than confirmation of whether it’s theoretically possible.
Specifically, I’m looking for help with:
Reference Docker images
Do you provide (or recommend) a RunPod-maintained Docker image for running Llama-4 with vLLM on Serverless?
If not, is there a reference image or example you recommend as a starting point?
Model loading strategy on Serverless
For production Serverless workloads, is the recommended approach to:
bake Llama-4 weights into the Docker image, or
download weights at startup and rely on the warm worker lifecycle?
Are there size or startup-time thresholds where one approach is preferred?
Serverless-specific constraints
Are there known Serverless limits (ephemeral disk size, image size, startup timeout, worker recycling behavior) that differ from Pods and that we should explicitly account for when running larger models like Llama-4?
Production recommendations
Given a production use case with steady traffic and warm workers, is Serverless still a supported and recommended product for Llama-4, or do you advise migrating this workload to Pods?
I want to ensure I’m following the intended and supported setup for production rather than relying on behavior that might change.
Thanks for any concrete guidance or references you can share.
Best, Wilbur
Recent Announcements
Continue the conversation
Join the Discord to ask follow-up questions and connect with the community
R
Runpod
We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!