without MAX_MODEL_LEN 15000, i got The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18368). Try increasing
gpu_memory_utilization
gpu_memory_utilization
or decreasing
max_model_len
max_model_len
when initializing the engine. "
2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B 2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B 2024-08-08T12:44:26Z 4f4fb700ef54 Pull complete 2024-08-08T12:44:26Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70 2024-08-08T12:44:26Z Status: Downloaded newer image for runpod/worker-v1-vllm:stable-cuda12.1.0 2024-08-08T12:44:26Z worker is ready 2024-08-08T12:44:38Z create pod network 2024-08-08T12:44:38Z create container runpod/worker-v1-vllm:stable-cuda12.1.0 2024-08-08T12:44:38Z stable-cuda12.1.0 Pulling from runpod/worker-v1-vllm 2024-08-08T12:44:38Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70 2024-08-08T12:44:38Z Status: Image is up to date for runpod/worker-v1-vllm:stable-cuda12.1.0 2024-08-08T12:44:38Z worker is ready 2024-08-08T12:44:39Z start container 2024-08-08T12:48:14Z start container