Serverless
Hi team
,
I ran into an issue with unexpected billing (around $400) on my serverless vLLM endpoint while it was idle.
Support explained it was caused by a CUDA 12.9 misconfiguration in my endpoint settings. They kindly applied a $100 credit
, but I’d like to make sure I configure things correctly moving forward.
Could you clarify:
Which CUDA version is recommended for running meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on vLLM?
How to ensure the pod truly scales down to zero when idle so I don’t continue to incur charges unnecessarily?
Appreciate your guidance
I ran into an issue with unexpected billing (around $400) on my serverless vLLM endpoint while it was idle.
Support explained it was caused by a CUDA 12.9 misconfiguration in my endpoint settings. They kindly applied a $100 credit
Could you clarify:
Which CUDA version is recommended for running meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on vLLM?
How to ensure the pod truly scales down to zero when idle so I don’t continue to incur charges unnecessarily?
Appreciate your guidance
