Runpod•3mo ago

Serverless

Hi team 👋, I ran into an issue with unexpected billing (around $400) on my serverless vLLM endpoint while it was idle. Support explained it was caused by a CUDA 12.9 misconfiguration in my endpoint settings. They kindly applied a $100 credit 🙏, but I’d like to make sure I configure things correctly moving forward. Could you clarify: Which CUDA version is recommended for running meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on vLLM? How to ensure the pod truly scales down to zero when idle so I don’t continue to incur charges unnecessarily? Appreciate your guidance 🚀

5 Replies

Unknown User•3mo ago

Message Not Public

AbhishekOP•3mo ago

Hi @Jason 👋, thanks for jumping in earlier! I’m working on deploying meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on serverless vLLM with RunPod. Here’s my current config: GPU: 2 × H200 SXM (141 GB each) CUDA Versions allowed: 12.1 – 12.7 Device: cuda Idle Timeout: 5s (to scale-to-zero quickly) Execution Timeout: 600s Flashboot: enabled Env Vars: MAX_MODEL_LEN=128000 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True RAW_OPENAI_OUTPUT=1 OPENAI_SERVED_MODEL_NAME_OVERRIDE=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 But I’m still running into OOM issues: ERROR 08-21 09:43:51 [multiproc_executor.py:511] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.00 GiB. GPU 0 has a total capacity of 139.72 GiB of which 260.69 MiB is free. Process 668548 has 139.46 GiB memory in use. Of the allocated memory 138.04 GiB is allocated by PyTorch, and 8.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. 👉 Could you please confirm the optimal configuration for this FP8 model on H200 SXM? Is 2× H200 actually required, or should it run fine on 1 GPU? Is my MAX_MODEL_LEN=128000 too aggressive for FP8 mode? Any best practices you’d recommend for memory allocation/env vars with this model on vLLM? Would appreciate your guidance 🚀

Unknown User•3mo ago

Message Not Public

Dj•3mo ago

Also, for your refund - did you get to reach out to support yet or no? Not a problem either way, just making sure you're good.

AbhishekOP•2mo ago

yes @Dj I contacted RunPod customer support and received only $100 as goodwill. However, the instability in CUDA 12.8 and 12.9 isn’t our fault since those versions are selected by default. Should I reach out to their team again to request a higher refund? sure @Jason , i will try it to configure with this env's and let you know if got any issue

Gaming

Programming

Serverless

Did you find this page helpful?