R
RunPod4mo ago
justin

VllM Memory Error / Runpod Error?

https://pastebin.com/vjSgS4up
Error initializing vLLM engine: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Error initializing vLLM engine: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
I get this error when I tried to start my vllm mistral serverless, it ended up fixing itself by just increasing the GPU to 24GB GPU Pro; which made me guess the GPU just wasn't good enough (even though it was my CPU indicating a 100% usage). But I guess the problem I have is how do I stop it from erroring out and repeating infinitely if it happens again? Does runpod or VLLM is it possible to catch this somehow? (The pastebin shows it worked eventually, cause that was a log from my second request after I upgraded the GPU, but otherwise it just kept going for a bit till i manually killed it)
Pastebin
2024-02-16 22:15:28.531[2akn5byerrxpel][info]Finished running gener...
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
4 Replies
justin
justin4mo ago
@Alpay Ariyak ;D not sure if u know, or if this is more of a generic runpod thing Also wondering.. if you happen to know if the mistral is just really dumb ;D.... or maybe something weird?
depot build -t justinwlin/mistral7b:1.0 --build-arg MODEL_NAME="mistralai/Mistral-7B-v0.1" --build-arg BASE_PATH="/models" . --platform linux/amd64 --project 2x6lg48dzf --push
depot build -t justinwlin/mistral7b:1.0 --build-arg MODEL_NAME="mistralai/Mistral-7B-v0.1" --build-arg BASE_PATH="/models" . --platform linux/amd64 --project 2x6lg48dzf --push
I asked it like: "Hello world", "Tell me a funny joke", etc:
{input: "Tell me a funny joke"}
{input: "Tell me a funny joke"}
And it responds very weirdly? It seems to always begin with a
"\n\nDoes anyone else like this?
"\n\nDoes anyone else like this?
{
"delayTime": 18319,
"executionTime": 1321,
"id": "56e22e71-9653-4eef-bcb8-b9e7e04a372d-u1",
"output": [
{
"choices": [
{
"tokens": [
"\n\nDoes anyone else like this?\n\nThe one that starts “What"
]
}
],
"usage": {
"input": 7,
"output": 16
}
}
],
"status": "COMPLETED"
}
{
"delayTime": 18319,
"executionTime": 1321,
"id": "56e22e71-9653-4eef-bcb8-b9e7e04a372d-u1",
"output": [
{
"choices": [
{
"tokens": [
"\n\nDoes anyone else like this?\n\nThe one that starts “What"
]
}
],
"usage": {
"input": 7,
"output": 16
}
}
],
"status": "COMPLETED"
}
{
"delayTime": 18346,
"executionTime": 1141,
"id": "7868b449-7332-40ce-bf6e-2849cb3a5253-u1",
"output": [
{
"choices": [
{
"tokens": [
"\n\nDoes anyone else like this?\n\nhi, i'm nice"
]
}
],
"usage": {
"input": 6,
"output": 16
}
}
],
"status": "COMPLETED"
}
{
"delayTime": 18346,
"executionTime": 1141,
"id": "7868b449-7332-40ce-bf6e-2849cb3a5253-u1",
"output": [
{
"choices": [
{
"tokens": [
"\n\nDoes anyone else like this?\n\nhi, i'm nice"
]
}
],
"usage": {
"input": 6,
"output": 16
}
}
],
"status": "COMPLETED"
}
Alpay Ariyak
Alpay Ariyak4mo ago
Hey, So mistralai/Mistral-7B-v0.1 is a completion/base model, so rather than being something you can chat with, its purpose is to complete the text you give it
Alpay Ariyak
Alpay Ariyak4mo ago
You have 2 options: 1. Use a chat/instruct model, such as mistralai/Mistral-7B-Instruct-v0.1 - this is the best option 2. Set a chat template using the CUSTOM_CHAT_TEMPLATE env variable. You can find jinja chat templates in tokenizer_config.json files of chat/instruct models. E.g. here's Mistral Instruct's chat template: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/blob/9ab9e76e2b09f9f29ea2d56aa5bd139e4445c59e/tokenizer_config.json#L32. If you really wanted to use Base mistral instead of Instruct, you would copy the template and set it as the CUSTOM_CHAT_TEMPLATE var. But you will get the best performance out of the first option
Alpay Ariyak
Alpay Ariyak4mo ago
In terms of this issue:
Error initializing vLLM engine: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Error initializing vLLM engine: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Try setting MAX_MODEL_LENGTH env var to a number under 24144 that will be enough for you