Selecting a hf quant
Hi, using vllm serveless. Is there a way to specify a specific quant to use for a hf gguf model directory url?
3 Replies
vLLM loads only single-file GGUF models, and you need to download the desired quantization variant yourself before serving it with vLLM. See https://github.com/vllm-project/vllm/issues/8570
Keep in mind that GGUF is optimized for CPU inference more than GPU. It's also mentioned to use the original model's tokenizer.
GitHub
[Usage]: Use GGUF model with docker when hf repo has multiple quant...
Update: I posted the solution below in my next comment. Your current environment I skipped the collect_env step as I use the latest docker container v0.6.1.post2 of vllm. How would you like to use ...
Thanks! Is there a better model format you recommend to take advantage of vllm serving other than gguf?
GPTQ or AWQ. vLLM has custom Marlin and Machete kernels for those. You can read about the inference and conversion here: https://docs.vllm.ai/en/stable/features/quantization/index.html