can't run 70b

any tips to run a 70b model, for example: mlabonne/Llama-3.1-70B-Instruct-lorablated

i tried that:
config
80GB GPU
2GPUs / Worker
container disk: 500 gb

env var:
MAX_MODEL_LEN 15000*
MODEL_NAME mlabonne/Llama-3.1-70B-Instruct-lorablated

but it doesn't work

without MAX_MODEL_LEN 15000, i got The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18368). Try increasing

gpu_memory_utilization

gpu_memory_utilization

or decreasing

max_model_len

max_model_len

when initializing the engine. "

2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B
2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B
2024-08-08T12:44:26Z 4f4fb700ef54 Pull complete
2024-08-08T12:44:26Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70
2024-08-08T12:44:26Z Status: Downloaded newer image for runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:26Z worker is ready
2024-08-08T12:44:38Z create pod network
2024-08-08T12:44:38Z create container runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:38Z stable-cuda12.1.0 Pulling from runpod/worker-v1-vllm
2024-08-08T12:44:38Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70
2024-08-08T12:44:38Z Status: Image is up to date for runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:38Z worker is ready
2024-08-08T12:44:39Z start container
2024-08-08T12:48:14Z start container

and nothing after

can't run 70b

gpu_memory_utilization

gpu_memory_utilization

or decreasing

max_model_len

max_model_len

can't run 70b

can't run 70b

Similar Threads