Not possible to set temperature / top_p using Serverless vLLM via quick deploy?
By default, vLLM loads sampling parameters (e.g.
temperature
/ top_p
) from a model's generation_config.json
if present. (see here: https://github.com/vllm-project/vllm/issues/15241). To override this you, have to pass --generation-config
when starting the vLLM server.
Because RunPod's worker-vllm (https://github.com/runpod-workers/worker-vllm) doesn't expose an environment variable to pipe through a --generation-config
value, does this mean it's not possible to change the temperature
or top_p
for any model deployed by Serverless vLLM quick deploy that has a generation_config.json
file e.g. all the Meta Llama models?
And the solutions is a customer Docker image / Worker?5 Replies
Unknown User•3mo ago
Message Not Public
Sign In & Join Server To View
generation_config="vllm"
is required when using openAI too. I think the official template should get inspired also by allowing the user to pass any vllm engine argument/env variable.
Ellroy if you want custom check this one. We allow exactly that, even dynamically per coldstart requestvLLM ignores the input from the OpenAI client unless the vLLM server is started with
--generation-config vllm
Thanks @3WaD - will check it out 🙂For anyone seeing this - I misunderstood how this works. Defaults are loaded from
generation_config.json
if present but values passed at runtime still take precedence, as you would expect – https://chatgpt.com/share/68516a00-0b28-8000-b2b6-a51b890e4be0