WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.