vLLM serverless output cutoff
I deployed a serverless vLLM using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
But when i made a request, output is only 16 tokens (tested many times), I don't change anything from default setting but max_model_length to 32768.
How can i fix that? or did I miss any config?

2 Replies
maybe you have to set the max token output in your request
or just use an openai client to connect to runpod ( like in the docs )
the default is 16 tokens like that i guess ( if you dont set any)
I was running into the same issue. There is a key for sampling_params, found the solution here https://docs.runpod.io/serverless/workers/vllm/get-started#sample-api-requests
Get started | RunPod Documentation
Deploy a Serverless Endpoint for large language models (LLMs) with RunPod, a simple and efficient way to run vLLM Workers with minimal configuration.