Can't set up the serverless vLLM for the model.

Please help solve the problem. When trying to make a request, these errors are logged: ▲ 2024-04-24 18:25:10.089 [hrkxm58yz2r504] [info] INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. {5 items"dt":"2024-04-24 15:25:10.089313""endpointid":"7pih5vdoqp1xsu""level":"info""message":"INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.""workerId":"hrkxm58yz2r504"} ▲ 2024-04-24 18:25:10.089 [hrkxm58yz2r504] [info] �INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. {5 items"dt":"2024-04-24 15:25:10.089260""endpointid":"7pih5vdoqp1xsu""level":"info""message":"�INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.""workerId":"hrkxm58yz2r504"} ▲ 2024-04-24 18:25:08.884 [hrkxm58yz2r504] [info] MINFO 04-24 15:25:08 llm_engine.py:337] # GPU blocks: 1199, # CPU blocks: 327 Configuration:
No description
17 Replies
Alpay Ariyak
Alpay Ariyak3mo ago
What size is your GPU?
haris
haris3mo ago
cc: @Kostya ^^^
Kostya | Matrix One
@Alpay Ariyak @haris 24Gb GPU
nerdylive
nerdylive3mo ago
I don't see the error there BTW, seems like just info's for the feature that is used there
Kostya | Matrix One
@nerdylive @haris Could you please tell me if this model (https://huggingface.co/TheBloke/MythoMax-L2-13B-GPTQ) is compatible?
nerdylive
nerdylive3mo ago
maybe it is
Alpay Ariyak
Alpay Ariyak3mo ago
It is compatible Like @nerdylive said, there’s no error message, just warnings
Kostya | Matrix One
This is very strange, because this model (https://huggingface.co/solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ) works. What is the difference between them and how can I get MythoMax-L2-13B-GPTQ to work?
nerdylive
nerdylive3mo ago
Whats making it not working? Any errors?
digigoblin
digigoblin3mo ago
AWQ and GPTQ are 2 different kinds of quantization methods. You can't really compare AWQ with GPTQ, you should compare another GPTQ model with GPTQ rather than AWQ vs GPTQ.
Kostya | Matrix One
@nerdylive There are no errors in the logs, only informational logs are being displayed.
nerdylive
nerdylive3mo ago
Press the running worker Then there will be a log button
digigoblin
digigoblin3mo ago
Ran out of VRAM
Kostya | Matrix One
Could you please tell me how to increase VRAM?
digigoblin
digigoblin3mo ago
Use 48GB instead of 24GB tier
Kostya | Matrix One
Thank you very much, it worked. I have another question. We use two fields in the request to the route /openai/v1/chat/completions: messages and prompt. According to the documentation https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#chat-completions and the API response, we cannot use these two fields simultaneously. Can we really not use these fields simultaneously in the request, or am I doing something wrong?
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm