vLLM serverless output cutoff
I deployed a serverless vLLM using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
But when i made a request, output is only 16 tokens (tested many times), I don't change anything from default setting but max_model_length to 32768.
How can i fix that? or did I miss any config?

2 Replies
Unknown User•9mo ago
Message Not Public
Sign In & Join Server To View
I was running into the same issue. There is a key for sampling_params, found the solution here https://docs.runpod.io/serverless/workers/vllm/get-started#sample-api-requests
Get started | RunPod Documentation
Deploy a Serverless Endpoint for large language models (LLMs) with RunPod, a simple and efficient way to run vLLM Workers with minimal configuration.