Runpod•5mo ago

Serverless

Hi team

,

I ran into an issue with unexpected billing (around $400) on my serverless vLLM endpoint while it was idle.
Support explained it was caused by a CUDA 12.9 misconfiguration in my endpoint settings. They kindly applied a $100 credit

, but I’d like to make sure I configure things correctly moving forward.

Could you clarify:

Which CUDA version is recommended for running meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on vLLM?

How to ensure the pod truly scales down to zero when idle so I don’t continue to incur charges unnecessarily?

Appreciate your guidance

Jason•8/21/25, 8:51 AM

Which VLLM are you using?

Jason•8/21/25, 8:52 AM

if you're using https://github.com/runpod-workers/worker-vllm the base one 12.1 - 12.7 should work

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

JJason if you're using https://github.com/runpod-workers/worker-vllm the base one 12.1 ...

AbhishekOP•8/21/25, 10:02 AM

Hi @Jason

, thanks for jumping in earlier!

I’m working on deploying meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on serverless vLLM with RunPod.
Here’s my current config:

GPU: 2 × H200 SXM (141 GB each)

CUDA Versions allowed: 12.1 – 12.7

Device: cuda

Idle Timeout: 5s (to scale-to-zero quickly)

Execution Timeout: 600s

Flashboot: enabled

Env Vars:

MAX_MODEL_LEN=128000

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

RAW_OPENAI_OUTPUT=1

OPENAI_SERVED_MODEL_NAME_OVERRIDE=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8

But I’m still running into OOM issues:

ERROR 08-21 09:43:51 [multiproc_executor.py:511] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.00 GiB. GPU 0 has a total capacity of 139.72 GiB of which 260.69 MiB is free. Process 668548 has 139.46 GiB memory in use. Of the allocated memory 138.04 GiB is allocated by PyTorch, and 8.63 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

Could you please confirm the optimal configuration for this FP8 model on H200 SXM?

Is 2× H200 actually required, or should it run fine on 1 GPU?

Is my MAX_MODEL_LEN=128000 too aggressive for FP8 mode?

Any best practices you’d recommend for memory allocation/env vars with this model on vLLM?

Would appreciate your guidance

Jason•8/21/25, 11:33 AM

Probably 2x or more, im not sure let me chck & you should configure another env

Jason•8/21/25, 11:36 AM

Env Vars:

MAX_MODEL_LEN=128000

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

RAW_OPENAI_OUTPUT=1

OPENAI_SERVED_MODEL_NAME_OVERRIDE=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
TENSOR_PARALLEL_SIZE=2

Jason•8/21/25, 11:36 AM

or set TENSOR_PARALLEL_SIZE to your gpu amount per worker

Jason•8/21/25, 12:42 PM

if you still get oom with that, try reducing your max model len it might help because it will be using less VRAM

Dj•8/21/25, 3:28 PM

Also, for your refund - did you get to reach out to support yet or no? Not a problem either way, just making sure you're good.

DDj Also, for your refund - did you get to reach out to support yet or no? Not a pro...

AbhishekOP•8/25/25, 8:12 AM

yes @Dj I contacted RunPod customer support and received only $100 as goodwill. However, the instability in CUDA 12.8 and 12.9 isn’t our fault since those versions are selected by default. Should I reach out to their team again to request a higher refund?

JJason Probably 2x or more, im not sure let me chck & you should configure another env

AbhishekOP•8/25/25, 8:13 AM

sure @Jason , i will try it to configure with this env's and let you know if got any issue

Serverless

Similar Threads

Similar Threads

Similar Threads