RunpodR
Runpod5mo ago
Abhishek

Serverless

Hi team 👋,

I ran into an issue with unexpected billing (around $400) on my serverless vLLM endpoint while it was idle.
Support explained it was caused by a CUDA 12.9 misconfiguration in my endpoint settings. They kindly applied a $100 credit 🙏, but I’d like to make sure I configure things correctly moving forward.

Could you clarify:

Which CUDA version is recommended for running meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on vLLM?

How to ensure the pod truly scales down to zero when idle so I don’t continue to incur charges unnecessarily?

Appreciate your guidance 🚀
Screenshot_from_2025-08-21_13-55-03.png
Was this page helpful?