Runpod•5mo ago

Not possible to set temperature / top_p using Serverless vLLM via quick deploy?

By default, vLLM loads sampling parameters (e.g. temperature / top_p) from a model's generation_config.json if present. (see here: https://github.com/vllm-project/vllm/issues/15241). To override this you, have to pass --generation-config when starting the vLLM server. Because RunPod's worker-vllm (https://github.com/runpod-workers/worker-vllm) doesn't expose an environment variable to pipe through a --generation-config value, does this mean it's not possible to change the temperature or top_p for any model deployed by Serverless vLLM quick deploy that has a generation_config.json file e.g. all the Meta Llama models? And the solutions is a customer Docker image / Worker?

5 Replies

Unknown User•5mo ago

Message Not Public

3WaD•5mo ago

generation_config="vllm" is required when using openAI too. I think the official template should get inspired also by allowing the user to pass any vllm engine argument/env variable. Ellroy if you want custom check this one. We allow exactly that, even dynamically per coldstart request

EllroyOP•5mo ago

vLLM ignores the input from the OpenAI client unless the vLLM server is started with --generation-config vllm Thanks @3WaD - will check it out 🙂

EllroyOP•5mo ago

For anyone seeing this - I misunderstood how this works. Defaults are loaded from generation_config.json if present but values passed at runtime still take precedence, as you would expect – https://chatgpt.com/share/68516a00-0b28-8000-b2b6-a51b890e4be0

ChatGPT

ChatGPT - vLLM sampling params precedence

Shared via ChatGPT

3WaD•5mo ago

For anyone overusing ChatGPT - the AI bot on the vLLM docs page has access to the entire source code, documentation, and GitHub issues and discussions.

Gaming

Programming

Not possible to set temperature / top_p using Serverless vLLM via quick deploy?

Did you find this page helpful?