Runpod•9mo ago

vLLM serverless output cutoff

I deployed a serverless vLLM using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B But when i made a request, output is only 16 tokens (tested many times), I don't change anything from default setting but max_model_length to 32768. How can i fix that? or did I miss any config?

2 Replies

Unknown User•9mo ago

Message Not Public

Atherion•8mo ago

I was running into the same issue. There is a key for sampling_params, found the solution here https://docs.runpod.io/serverless/workers/vllm/get-started#sample-api-requests

Get started | RunPod Documentation

Deploy a Serverless Endpoint for large language models (LLMs) with RunPod, a simple and efficient way to run vLLM Workers with minimal configuration.

Gaming

Programming

vLLM serverless output cutoff

Did you find this page helpful?