how to set a max output token
Hi, I deployed a finetuned llama 3 via vllm serverless on runpod. However, I'm getting limited output tokens everytime. Does anyone know if we can alter the max output tokens while sending the input prompt json?
7 Replies
vllm does not support yet llama 3.1
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
I'm not using llama 3.1, it's the old llama 3
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
Are you asking how to set the Max Model Length parameter inside the vLLM worker? It is under LLM Settings.

No, this is more relevant to the context length right. I'm talking about output tokens
This should do the job, let me try this