Incredibly long startup time when running 70b models via vllm

I have been trying to deploy 70b models as a serverless endpoint and observe start up times of almost 1 hour, if the endpoint becomes available at all. The attached screenshot shows an example of an endpoint that deploys

cognitivecomputations/dolphin-2.9.1-llama-3-70b

cognitivecomputations/dolphin-2.9.1-llama-3-70b

. I find it even weirder that the request ultimately succeeds. Logs and screenshot of the endpoint and template config are attached - if anyone can spot an issue or knows how to deploy 70b models such that they reliably work I would greatly appreciate it.

Some other observations:
- in support, someone told me that I need to manually set the env

BASE_PATH=/workspace

BASE_PATH=/workspace

, which I am now always doing
- I sometimes but not always see this in the logs:

AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'...

AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'...

, even though I am deploying a completely different model
- I sometimes but not always get issues when I don't specify the chat template

[rank0]: TypeError: expected str, bytes or os.PathLike object, not dict\n
2024-11-12 12:59:15.351
[rank0]: with open(chat_template, "r") as f:\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/chat_utils.py", line 335, in load_chat_template\n
[rank0]: self.chat_template = load_chat_template(chat_template)\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 73, in __init__\n

[rank0]: TypeError: expected str, bytes or os.PathLike object, not dict\n
2024-11-12 12:59:15.351
[rank0]: with open(chat_template, "r") as f:\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/chat_utils.py", line 335, in load_chat_template\n
[rank0]: self.chat_template = load_chat_template(chat_template)\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 73, in __init__\n

logs-dolphin-2.9.1-llama-3-70b-fb.txt80.57KB

Runpod•16mo ago•

10 replies

nielsrolf

Incredibly long startup time when running 70b models via vllm

cognitivecomputations/dolphin-2.9.1-llama-3-70b

cognitivecomputations/dolphin-2.9.1-llama-3-70b

BASE_PATH=/workspace

BASE_PATH=/workspace

, which I am now always doing
- I sometimes but not always see this in the logs:

AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'...

AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'...

, even though I am deploying a completely different model
- I sometimes but not always get issues when I don't specify the chat template

[rank0]: TypeError: expected str, bytes or os.PathLike object, not dict\n
2024-11-12 12:59:15.351
[rank0]: with open(chat_template, "r") as f:\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/chat_utils.py", line 335, in load_chat_template\n
[rank0]: self.chat_template = load_chat_template(chat_template)\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 73, in __init__\n

[rank0]: TypeError: expected str, bytes or os.PathLike object, not dict\n
2024-11-12 12:59:15.351
[rank0]: with open(chat_template, "r") as f:\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/chat_utils.py", line 335, in load_chat_template\n
[rank0]: self.chat_template = load_chat_template(chat_template)\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 73, in __init__\n

logs-dolphin-2.9.1-llama-3-70b-fb.txt80.57KB

Incredibly long startup time when running 70b models via vllm

Incredibly long startup time when running 70b models via vllm

Similar Threads

Similar Threads

Similar Threads