Incredibly long startup time when running 70b models via vllm
I have been trying to deploy 70b models as a serverless endpoint and observe start up times of almost 1 hour, if the endpoint becomes available at all. The attached screenshot shows an example of an endpoint that deploys
Some other observations:
cognitivecomputations/dolphin-2.9.1-llama-3-70b . I find it even weirder that the request ultimately succeeds. Logs and screenshot of the endpoint and template config are attached - if anyone can spot an issue or knows how to deploy 70b models such that they reliably work I would greatly appreciate it.Some other observations:
- in support, someone told me that I need to manually set the env
BASE_PATH=/workspace, which I am now always doing - I sometimes but not always see this in the logs:
AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'..., even though I am deploying a completely different model - I sometimes but not always get issues when I don't specify the chat template


