Serverless

Hi team 👋, I ran into an issue with unexpected billing (around $400) on my serverless vLLM endpoint while it was idle. Support explained it was caused by a CUDA 12.9 misconfiguration in my endpoint settings. They kindly applied a $100 credit 🙏, but I’d like to make sure I configure things correctly moving forward. Could you clarify: Which CUDA version is recommended for running meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on vLLM? How to ensure the pod truly scales down to zero when idle so I don’t continue to incur charges unnecessarily? Appreciate your guidance 🚀
No description
4 Replies
Unknown User
Unknown User4d ago
Message Not Public
Sign In & Join Server To View
Abhishek
AbhishekOP4d ago
Hi @Jason 👋, thanks for jumping in earlier! I’m working on deploying meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on serverless vLLM with RunPod. Here’s my current config: GPU: 2 × H200 SXM (141 GB each) CUDA Versions allowed: 12.1 – 12.7 Device: cuda Idle Timeout: 5s (to scale-to-zero quickly) Execution Timeout: 600s Flashboot: enabled Env Vars: MAX_MODEL_LEN=128000 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True RAW_OPENAI_OUTPUT=1 OPENAI_SERVED_MODEL_NAME_OVERRIDE=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 But I’m still running into OOM issues: ERROR 08-21 09:43:51 [multiproc_executor.py:511] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.00 GiB. GPU 0 has a total capacity of 139.72 GiB of which 260.69 MiB is free. Process 668548 has 139.46 GiB memory in use. Of the allocated memory 138.04 GiB is allocated by PyTorch, and 8.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. 👉 Could you please confirm the optimal configuration for this FP8 model on H200 SXM? Is 2× H200 actually required, or should it run fine on 1 GPU? Is my MAX_MODEL_LEN=128000 too aggressive for FP8 mode? Any best practices you’d recommend for memory allocation/env vars with this model on vLLM? Would appreciate your guidance 🚀
Unknown User
Unknown User4d ago
Message Not Public
Sign In & Join Server To View
Dj
Dj4d ago
Also, for your refund - did you get to reach out to support yet or no? Not a problem either way, just making sure you're good.

Did you find this page helpful?