Runpod•6mo ago

Selecting a hf quant

Hi, using vllm serveless. Is there a way to specify a specific quant to use for a hf gguf model directory url?

3 Replies

3WaD•6mo ago

vLLM loads only single-file GGUF models, and you need to download the desired quantization variant yourself before serving it with vLLM. See https://github.com/vllm-project/vllm/issues/8570 Keep in mind that GGUF is optimized for CPU inference more than GPU. It's also mentioned to use the original model's tokenizer.

GitHub

[Usage]: Use GGUF model with docker when hf repo has multiple quant...

Update: I posted the solution below in my next comment. Your current environment I skipped the collect_env step as I use the latest docker container v0.6.1.post2 of vllm. How would you like to use ...

Bj9000OP•6mo ago

Thanks! Is there a better model format you recommend to take advantage of vllm serving other than gguf?

3WaD•6mo ago

GPTQ or AWQ. vLLM has custom Marlin and Machete kernels for those. You can read about the inference and conversion here: https://docs.vllm.ai/en/stable/features/quantization/index.html

Gaming

Programming

Selecting a hf quant

Did you find this page helpful?