Can't set up the serverless vLLM for the model.

Please help solve the problem. When trying to make a request, these errors are logged: ▲ 2024-04-24 18:25:10.089 [hrkxm58yz2r504] [info] INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. {5 items"dt":"2024-04-24 15:25:10.089313""endpointid":"7pih5vdoqp1xsu""level":"info""message":"INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.""workerId":"hrkxm58yz2r504"} ▲ 2024-04-24 18:25:10.089 [hrkxm58yz2r504] [info] �INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. {5 items"dt":"2024-04-24 15:25:10.089260""endpointid":"7pih5vdoqp1xsu""level":"info""message":"�INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.""workerId":"hrkxm58yz2r504"} ▲ 2024-04-24 18:25:08.884 [hrkxm58yz2r504] [info] MINFO 04-24 15:25:08 llm_engine.py:337] # GPU blocks: 1199, # CPU blocks: 327 Configuration:
No description
17 Replies
Alpay Ariyak
Alpay Ariyak13mo ago
What size is your GPU?
haris
haris13mo ago
cc: @Kostya ^^^
Kostya Popelnukh
Kostya PopelnukhOP13mo ago
@Alpay Ariyak @haris 24Gb GPU
Jason
Jason13mo ago
I don't see the error there BTW, seems like just info's for the feature that is used there
Kostya Popelnukh
Kostya PopelnukhOP12mo ago
@nerdylive @haris Could you please tell me if this model (https://huggingface.co/TheBloke/MythoMax-L2-13B-GPTQ) is compatible?
Jason
Jason12mo ago
maybe it is
Alpay Ariyak
Alpay Ariyak12mo ago
It is compatible Like @nerdylive said, there’s no error message, just warnings
Kostya Popelnukh
Kostya PopelnukhOP12mo ago
This is very strange, because this model (https://huggingface.co/solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ) works. What is the difference between them and how can I get MythoMax-L2-13B-GPTQ to work?
Jason
Jason12mo ago
Whats making it not working? Any errors?
digigoblin
digigoblin12mo ago
AWQ and GPTQ are 2 different kinds of quantization methods. You can't really compare AWQ with GPTQ, you should compare another GPTQ model with GPTQ rather than AWQ vs GPTQ.
Kostya Popelnukh
Kostya PopelnukhOP12mo ago
@nerdylive There are no errors in the logs, only informational logs are being displayed.
Jason
Jason12mo ago
Press the running worker Then there will be a log button
Kostya Popelnukh
Kostya PopelnukhOP12mo ago
digigoblin
digigoblin12mo ago
Ran out of VRAM
Kostya Popelnukh
Kostya PopelnukhOP12mo ago
Could you please tell me how to increase VRAM?
digigoblin
digigoblin12mo ago
Use 48GB instead of 24GB tier
Kostya Popelnukh
Kostya PopelnukhOP12mo ago
Thank you very much, it worked. I have another question. We use two fields in the request to the route /openai/v1/chat/completions: messages and prompt. According to the documentation https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#chat-completions and the API response, we cannot use these two fields simultaneously. Can we really not use these fields simultaneously in the request, or am I doing something wrong?
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Did you find this page helpful?