Runpod•15mo ago

Runpod VLLM - How to use GGUF with VLLM

I have this repo mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF and I use this command
"--host 0.0.0.0 --port 8000 --max-model-len 37472 --model mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF --dtype bfloat16 --gpu-memory-utilization 0.95 --quantization gguf" but it doesn't work...

It say "2024-10-07T20:39:24.964316283Z ValueError: No supported config format found in mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF"

I don't have this problem with normal models, only with quantized one...

Jason•10/7/24, 11:17 PM

Meaning it doesn't support gguf yet

Jason•10/7/24, 11:17 PM

The vllm

Sal ✨OP•10/8/24, 1:05 PM

Can I upgrade my vLLM version on my template. I use the runpod vvlm/openai

Jason•10/8/24, 2:42 PM

Hmm your template? Sure but Runpod's image is different I think you must create a pr to the github repo

Jason•10/8/24, 2:42 PM

Or at least create an issue to get a notice of that feature

Jason•10/8/24, 2:43 PM

If you got the docs (vllm) which doesn't exist right now for it, attach the links too

SSal ✨I have this repo mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF and I use this co...

Encyrption•10/8/24, 2:43 PM

Have you tried converting the model so that RunPod can use it? Here is example Python for converting it:

import torch
# Load the PyTorch model
model = torch.load("your_model.pth")

# Dummy input for the model export
dummy_input = torch.randn(1, 3, 224, 224)

# Export the model to ONNX
torch.onnx.export(model, dummy_input, "model.onnx", verbose=True)

import torch
# Load the PyTorch model
model = torch.load("your_model.pth")

# Dummy input for the model export
dummy_input = torch.randn(1, 3, 224, 224)

# Export the model to ONNX
torch.onnx.export(model, dummy_input, "model.onnx", verbose=True)

Jason•10/8/24, 2:44 PM

Wow model convert, so that means the onnx is still quantized?

JJason Wow model convert, so that means the onnx is still quantized?

Encyrption•10/8/24, 2:49 PM

I don't know such things... I would think so.

Runpod VLLM - How to use GGUF with VLLM

Similar Threads

Similar Threads

Similar Threads