R
Runpod6mo ago
Bj9000

Selecting a hf quant

Hi, using vllm serveless. Is there a way to specify a specific quant to use for a hf gguf model directory url?
3 Replies
3WaD
3WaD6mo ago
vLLM loads only single-file GGUF models, and you need to download the desired quantization variant yourself before serving it with vLLM. See https://github.com/vllm-project/vllm/issues/8570 Keep in mind that GGUF is optimized for CPU inference more than GPU. It's also mentioned to use the original model's tokenizer.
GitHub
[Usage]: Use GGUF model with docker when hf repo has multiple quant...
Update: I posted the solution below in my next comment. Your current environment I skipped the collect_env step as I use the latest docker container v0.6.1.post2 of vllm. How would you like to use ...
Bj9000
Bj9000OP6mo ago
Thanks! Is there a better model format you recommend to take advantage of vllm serving other than gguf?
3WaD
3WaD6mo ago
GPTQ or AWQ. vLLM has custom Marlin and Machete kernels for those. You can read about the inference and conversion here: https://docs.vllm.ai/en/stable/features/quantization/index.html

Did you find this page helpful?