RunPod•13mo ago

Can't set up the serverless vLLM for the model.

Please help solve the problem. When trying to make a request, these errors are logged: ▲ 2024-04-24 18:25:10.089 [hrkxm58yz2r504] [info] INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. {5 items"dt":"2024-04-24 15:25:10.089313""endpointid":"7pih5vdoqp1xsu""level":"info""message":"INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.""workerId":"hrkxm58yz2r504"} ▲ 2024-04-24 18:25:10.089 [hrkxm58yz2r504] [info] �INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. {5 items"dt":"2024-04-24 15:25:10.089260""endpointid":"7pih5vdoqp1xsu""level":"info""message":"�INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.""workerId":"hrkxm58yz2r504"} ▲ 2024-04-24 18:25:08.884 [hrkxm58yz2r504] [info] MINFO 04-24 15:25:08 llm_engine.py:337] # GPU blocks: 1199, # CPU blocks: 327 Configuration:

17 Replies

Alpay Ariyak•13mo ago

What size is your GPU?

haris•13mo ago

cc: @Kostya ^^^

Kostya PopelnukhOP•13mo ago

@Alpay Ariyak @haris 24Gb GPU

Jason•13mo ago

I don't see the error there BTW, seems like just info's for the feature that is used there

Kostya PopelnukhOP•12mo ago

@nerdylive @haris Could you please tell me if this model (https://huggingface.co/TheBloke/MythoMax-L2-13B-GPTQ) is compatible?

TheBloke/MythoMax-L2-13B-GPTQ · Hugging Face

Jason•12mo ago

maybe it is

Alpay Ariyak•12mo ago

It is compatible Like @nerdylive said, there’s no error message, just warnings

Kostya PopelnukhOP•12mo ago

This is very strange, because this model (https://huggingface.co/solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ) works. What is the difference between them and how can I get MythoMax-L2-13B-GPTQ to work?

solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ · Hugging Face

Jason•12mo ago

Whats making it not working? Any errors?

digigoblin•12mo ago

AWQ and GPTQ are 2 different kinds of quantization methods. You can't really compare AWQ with GPTQ, you should compare another GPTQ model with GPTQ rather than AWQ vs GPTQ.

Kostya PopelnukhOP•12mo ago

@nerdylive There are no errors in the logs, only informational logs are being displayed.

Jason•12mo ago

Press the running worker Then there will be a log button

Kostya PopelnukhOP•12mo ago

digigoblin•12mo ago

Ran out of VRAM

Kostya PopelnukhOP•12mo ago

Could you please tell me how to increase VRAM?

digigoblin•12mo ago

Use 48GB instead of 24GB tier

Kostya PopelnukhOP•12mo ago

Thank you very much, it worked. I have another question. We use two fields in the request to the route /openai/v1/chat/completions: messages and prompt. According to the documentation https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#chat-completions and the API response, we cannot use these two fields simultaneously. Can we really not use these fields simultaneously in the request, or am I doing something wrong?

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Gaming

Programming

Can't set up the serverless vLLM for the model.

Did you find this page helpful?