Runpod•12mo ago•

19 replies

Model Maximum Context Length Error

Hi there, I run an AI chat site (https://www.hammerai.com). I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (

Container Image: vllm/vllm-openai:latest

Container Image: vllm/vllm-openai:latest

Container Image: vllm/vllm-openai:latest

Container Image: vllm/vllm-openai:latest. Here is my configuration:

--host 0.0.0.0 --port 8000 --model LoneStriker/Fimbulvetr-11B-v2-AWQ --enforce-eager --gpu-memory-utilization 0.95 --api-key foo --max-model-len 4096 --max-seq-len-to-capture 4096 --trust-remote-code --chat-template "{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }}  {% for message in messages %} {% if message['role'] == 'user' %} ### Instruction: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% elif message['role'] == 'assistant' %} ### Response: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% elif message['role'] == 'user_context' %} ### Input: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% endif %} {% endfor %} {% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} ### Response: {% endif %}"

--host 0.0.0.0 --port 8000 --model LoneStriker/Fimbulvetr-11B-v2-AWQ --enforce-eager --gpu-memory-utilization 0.95 --api-key foo --max-model-len 4096 --max-seq-len-to-capture 4096 --trust-remote-code --chat-template "{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }}  {% for message in messages %} {% if message['role'] == 'user' %} ### Instruction: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% elif message['role'] == 'assistant' %} ### Response: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% elif message['role'] == 'user_context' %} ### Input: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% endif %} {% endfor %} {% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} ### Response: {% endif %}"

--host 0.0.0.0 --port 8000 --model LoneStriker/Fimbulvetr-11B-v2-AWQ --enforce-eager --gpu-memory-utilization 0.95 --api-key foo --max-model-len 4096 --max-seq-len-to-capture 4096 --trust-remote-code --chat-template "{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }}  {% for message in messages %} {% if message['role'] == 'user' %} ### Instruction: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% elif message['role'] == 'assistant' %} ### Response: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% elif message['role'] == 'user_context' %} ### Input: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% endif %} {% endfor %} {% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} ### Response: {% endif %}"

--host 0.0.0.0 --port 8000 --model LoneStriker/Fimbulvetr-11B-v2-AWQ --enforce-eager --gpu-memory-utilization 0.95 --api-key foo --max-model-len 4096 --max-seq-len-to-capture 4096 --trust-remote-code --chat-template "{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }}  {% for message in messages %} {% if message['role'] == 'user' %} ### Instruction: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% elif message['role'] == 'assistant' %} ### Response: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% elif message['role'] == 'user_context' %} ### Input: {{ message['content']|trim -}} {% if not loop.last %}   {% endif %} {% endif %} {% endfor %} {% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} ### Response: {% endif %}"

I then call it with:

import {convertToCoreMessages, streamText} from 'ai' // the vercel ai sdk

export async function POST(req: NextRequest): Promise<Response> {
    ...
    // Depending on whether it is a chat or a completion, send `messages` or `prompt`:
    const response = await streamText({
      ...(generateChat
        ? {messages: convertToCoreMessages(generateChat.messages)}
        : {prompt: generateCompletion?.prompt}),

import {convertToCoreMessages, streamText} from 'ai' // the vercel ai sdk

export async function POST(req: NextRequest): Promise<Response> {
    ...
    // Depending on whether it is a chat or a completion, send `messages` or `prompt`:
    const response = await streamText({
      ...(generateChat
        ? {messages: convertToCoreMessages(generateChat.messages)}
        : {prompt: generateCompletion?.prompt}),

import {convertToCoreMessages, streamText} from 'ai' // the vercel ai sdk

export async function POST(req: NextRequest): Promise<Response> {
    ...
    // Depending on whether it is a chat or a completion, send `messages` or `prompt`:
    const response = await streamText({
      ...(generateChat
        ? {messages: convertToCoreMessages(generateChat.messages)}
        : {prompt: generateCompletion?.prompt}),

import {convertToCoreMessages, streamText} from 'ai' // the vercel ai sdk

export async function POST(req: NextRequest): Promise<Response> {
    ...
    // Depending on whether it is a chat or a completion, send `messages` or `prompt`:
    const response = await streamText({
      ...(generateChat
        ? {messages: convertToCoreMessages(generateChat.messages)}
        : {prompt: generateCompletion?.prompt}),

jOP•1/22/25, 9:10 PM

But am now running into a new error:

responseBody: `{"object":"error","message":"This model's maximum context length is 4096 tokens. However, you requested 4133 tokens (3877 in the messages, 256 in the completion). Please reduce the length of the messages or completion.","type":"BadRequestError","param":null,"code":400}`,

responseBody: `{"object":"error","message":"This model's maximum context length is 4096 tokens. However, you requested 4133 tokens (3877 in the messages, 256 in the completion). Please reduce the length of the messages or completion.","type":"BadRequestError","param":null,"code":400}`,

I didn't see this when using the serverless endpoints. So my question:
- Is there something I can be setting on vLLM to automatically manage the context length for me? I.e. to delete tokens from the promptprompt or

messages

messages

automatically for me? Or do I need to manage this myself?

Thanks!

Jason•1/23/25, 3:26 AM

--max-model-len 4096 --max-seq-len-to-capture 4096
i guess its those two arguments ( your vllm start arguments)

jOP•1/23/25, 3:54 AM

Yep, but won't it just default to something else even if I don't set those? And then we'll run into the same issue at whatever number of tokens that is?

Jason•1/23/25, 4:02 AM

Yes, you got to set it higher

Jason•1/23/25, 4:02 AM

Yep, set to a number that will be your max length just estimate it or add more than your estimate

Jason•1/23/25, 4:03 AM

Bigger context length requires more vram btw

jOP•1/23/25, 5:00 AM

Yes, but when I do that, specifically setting to 8192, I get a separate error saying that I have exceeded the maximum context length. But in general, even if I manage to set it a little higher, won't I run into the same problem then?

Jason•1/23/25, 5:12 AM

oh..

Jason•1/23/25, 5:13 AM

then its the model that you used

Jason•1/23/25, 5:13 AM

has limits on context length

Jj Yes, but when I do that, specifically setting to 8192, I get a separate error sa...

Jason•1/23/25, 5:15 AM

can i see the error too?

Jason•1/23/25, 5:15 AM

maybe copy and paste the log

jOP•1/23/25, 5:17 AM

Unfortunately I didn't save it and Runpod logs don't go back that far - but I guess doesn't it not really matter as long as we have to set a max limit? Because in a chat application we'll eventually go past it.

Jj Unfortunately I didn't save it and Runpod logs don't go back that far - but I gu...

Jason•1/23/25, 5:29 AM

yeah use another model that can handle more context length

Jason•1/23/25, 5:29 AM

i think its set from the model

jOP•1/23/25, 5:49 AM

Got it - so vLLM doesn't help with truncating things? I just asked b/c coming from Ollama it will automatically update your prompt so that it continues to work even past the max context length.

Jj Got it - so vLLM doesn't help with truncating things? I just asked b/c coming fr...

Jason•1/23/25, 5:50 AM

hmm not sure, i think not

jOP•1/23/25, 11:00 PM

Got it. So do you know other AI chat sites handle this? Does everyone just write custom code if they're using Runpod vLLM?

Jj Got it. So do you know other AI chat sites handle this? Does everyone just write...

Jason•1/24/25, 3:26 AM

hmm, there might be some libraries / frameworks out there that helps with this but, i think other people dont use models with shorter context length unless thats not really important

Model Maximum Context Length Error

Similar Threads

Similar Threads

Similar Threads