Runpod•16mo ago

how to set a max output token

Hi, I deployed a finetuned llama 3 via vllm serverless on runpod. However, I'm getting limited output tokens everytime. Does anyone know if we can alter the max output tokens while sending the input prompt json?

7 Replies

Madiator2011•16mo ago

vllm does not support yet llama 3.1

Madiator2011•16mo ago

https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variables

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Unknown User•16mo ago

Message Not Public

Heartthrob10OP•16mo ago

I'm not using llama 3.1, it's the old llama 3

Unknown User•16mo ago

Message Not Public

PatrickR•16mo ago

Are you asking how to set the Max Model Length parameter inside the vLLM worker? It is under LLM Settings.

Heartthrob10OP•16mo ago

No, this is more relevant to the context length right. I'm talking about output tokens This should do the job, let me try this

Gaming

Programming

how to set a max output token

Did you find this page helpful?