Is it possible to run Llama 4 on serverless?

Hi RunPod team,

I’m currently running Llama-3.3 in production on RunPod Serverless using vLLM, with a worker that remains warm and handles continuous traffic successfully.

I’m now trying to upgrade this setup to Llama-4, and I’m looking for official guidance on how this should be configured on Serverless rather than confirmation of whether it’s theoretically possible.

Specifically, I’m looking for help with:

Reference Docker images

Do you provide (or recommend) a RunPod-maintained Docker image for running Llama-4 with vLLM on Serverless?

If not, is there a reference image or example you recommend as a starting point?

Model loading strategy on Serverless

For production Serverless workloads, is the recommended approach to:

bake Llama-4 weights into the Docker image, or

download weights at startup and rely on the warm worker lifecycle?

Are there size or startup-time thresholds where one approach is preferred?

Serverless-specific constraints

Are there known Serverless limits (ephemeral disk size, image size, startup timeout, worker recycling behavior) that differ from Pods and that we should explicitly account for when running larger models like Llama-4?

Production recommendations

Given a production use case with steady traffic and warm workers, is Serverless still a supported and recommended product for Llama-4, or do you advise migrating this workload to Pods?

I want to ensure I’m following the intended and supported setup for production rather than relying on behavior that might change.

Thanks for any concrete guidance or references you can share.

Best,
Wilbur

Is it possible to run Llama 4 on serverless?

Is it possible to run Llama 4 on serverless?

Continue the conversation

Runpod

Continue the conversation

Runpod

Similar Threads