Runpod•2y ago•

47 replies

Help: Serverless Mixtral OutOfMemory Error

I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models

Settings I'm using:
GPU:

48GB (also tried 80GB)

48GB (also tried 80GB)

Container Image:

runpod/worker-vllm:0.3.0-cuda11.8.0

runpod/worker-vllm:0.3.0-cuda11.8.0

Env 1:

MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ

MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ

(Also tried:

casperhansen/mixtral-instruct-awq

casperhansen/mixtral-instruct-awq

and

TheBloke/firefly-mixtral-8x7b-GPTQ

TheBloke/firefly-mixtral-8x7b-GPTQ

and

mistralai/Mixtral-8x7B-Instruct-v0.1

mistralai/Mixtral-8x7B-Instruct-v0.1

Env 2:

TRUST_REMOTE_CODE=1

TRUST_REMOTE_CODE=1

Env 3:

QUANTIZATION=awq

QUANTIZATION=awq

gptq

gptq

for gptq models

What am I doing wrong?? @Alpay Ariyak

ERROR Log:

WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Runpod•2y ago•

47 replies

octopus

Help: Serverless Mixtral OutOfMemory Error

I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models

Settings I'm using:
GPU:

48GB (also tried 80GB)

48GB (also tried 80GB)

Container Image:

runpod/worker-vllm:0.3.0-cuda11.8.0

runpod/worker-vllm:0.3.0-cuda11.8.0

Env 1:

MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ

MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ

(Also tried:

casperhansen/mixtral-instruct-awq

casperhansen/mixtral-instruct-awq

and

TheBloke/firefly-mixtral-8x7b-GPTQ

TheBloke/firefly-mixtral-8x7b-GPTQ

and

mistralai/Mixtral-8x7B-Instruct-v0.1

mistralai/Mixtral-8x7B-Instruct-v0.1

Env 2:

TRUST_REMOTE_CODE=1

TRUST_REMOTE_CODE=1

Env 3:

QUANTIZATION=awq

QUANTIZATION=awq

gptq

gptq

for gptq models

What am I doing wrong?? @Alpay Ariyak

ERROR Log:

WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Help: Serverless Mixtral OutOfMemory Error

Similar Threads

Help: Serverless Mixtral OutOfMemory Error

Similar Threads

Similar Threads

Similar Threads