Deploying bitsandbytes-quantized Models on RunPod Serverless using Custom Docker Image

J

Jason•11/1/24, 4:07 PM

im not sure if theres a way but if theres a way maybe you can unquantize it somehow? or convert it to another format that it supports

AArad Hey everyone 👋 Looking for tips from anyone who's worked with bitsandbytes-quan...

M

Mohamed Nagy•1/20/25, 6:54 PM

Any updates!, I want to do the same thing with 3.3 version.

J

Jason•1/21/25, 1:04 AM

you can make custom workers actually

J

Jason•1/21/25, 1:04 AM

with some code, library, frameworks that works inside linux that can run bitsandbytes models

J

Jason•1/21/25, 1:07 AM

https://docs.vllm.ai/en/stable/quantization/bnb.html
Oh actually
im not sure if theres an option for this for runpod vllm-worker

J

Jason•1/21/25, 1:16 AM

maybe something like this would work

J

Jason•1/21/25, 1:17 AM

whats the model's file extension for bnb?

.safetensors

.safetensors

.safetensors

.safetensors?

JJason maybe something like this would work

M

Mohamed Nagy•1/21/25, 9:32 AM

I tried this but, the vllm-worker checks the variable if its not one of the defined chocies

M

Mohamed Nagy•1/21/25, 9:33 AM

I fork the vllm-wrorker and change it to accept bitsandbytes

JJason whats the model's **file extension** for bnb? `.safetensors`?

M

Mohamed Nagy•1/21/25, 9:34 AM

yes

MMohamed Nagy I tried this but, the vllm-worker checks the variable if its not one of the defi...

J

Jason•1/21/25, 11:00 AM

Oh the file format?

M

Mohamed Nagy•1/21/25, 11:04 AM

in this

worker-config.json

worker-config.json

worker-config.json

worker-config.json

  "QUANTIZATION": {
    "env_var_name": "QUANTIZATION",
    "value": "",
    "title": "Quantization",
    "description": "Method used to quantize the weights.",
    "required": false,
    "type": "select",
    "options": [
      { "value": "None", "label": "None" },
      { "value": "awq", "label": "AWQ" },
      { "value": "squeezellm", "label": "SqueezeLLM" },
      { "value": "gptq", "label": "GPTQ" }
    ]
  },

  "QUANTIZATION": {
    "env_var_name": "QUANTIZATION",
    "value": "",
    "title": "Quantization",
    "description": "Method used to quantize the weights.",
    "required": false,
    "type": "select",
    "options": [
      { "value": "None", "label": "None" },
      { "value": "awq", "label": "AWQ" },
      { "value": "squeezellm", "label": "SqueezeLLM" },
      { "value": "gptq", "label": "GPTQ" }
    ]
  },

  "QUANTIZATION": {
    "env_var_name": "QUANTIZATION",
    "value": "",
    "title": "Quantization",
    "description": "Method used to quantize the weights.",
    "required": false,
    "type": "select",
    "options": [
      { "value": "None", "label": "None" },
      { "value": "awq", "label": "AWQ" },
      { "value": "squeezellm", "label": "SqueezeLLM" },
      { "value": "gptq", "label": "GPTQ" }
    ]
  },

  "QUANTIZATION": {
    "env_var_name": "QUANTIZATION",
    "value": "",
    "title": "Quantization",
    "description": "Method used to quantize the weights.",
    "required": false,
    "type": "select",
    "options": [
      { "value": "None", "label": "None" },
      { "value": "awq", "label": "AWQ" },
      { "value": "squeezellm", "label": "SqueezeLLM" },
      { "value": "gptq", "label": "GPTQ" }
    ]
  },

does not has

bitsandbytes

bitsandbytes

bitsandbytes

bitsandbytes

J

Jason•1/21/25, 11:05 AM

Ohhh

J

Jason•1/21/25, 11:05 AM

In the website it doesn't accept other than those options right?

J

Jason•1/21/25, 11:06 AM

Thanks for sharing this, it'll be helpful for others that wanna use bnb in the future

JJason https://docs.vllm.ai/en/stable/quantization/bnb.html Oh actually im not sure if ...

M

Mohamed Nagy•1/21/25, 11:07 AM

this may work, I am going to test runpod-vllm-worker with

LOAD_FORMAT

LOAD_FORMAT

LOAD_FORMAT

LOAD_FORMAT it supports

bitsandbytes

bitsandbytes

bitsandbytes

bitsandbytes
hope the src/engine will load it, I think it will not because in the github repo they does not handle it fully like in this https://docs.vllm.ai/en/stable/quantization/bnb.html

JJason In the website it doesn't accept other than those options right?

M

Mohamed Nagy•1/21/25, 11:07 AM

yes

M

Mohamed Nagy•1/21/25, 11:08 AM

I will inform you

MMohamed Nagy this may work, I am going to test runpod-vllm-worker with `LOAD_FORMAT` it suppo...

J

Jason•1/21/25, 11:11 AM

whats missing? i thought they pass on that arguments? (not sure about the load_format one) yeah..

M

Mohamed Nagy•1/21/25, 11:19 AM

I got the expected error

J

Jason•1/21/25, 11:20 AM

what error?

M

Mohamed Nagy•1/21/25, 11:34 AM

the param

Load_Fromat

Load_Fromat

Load_Fromat

Load_Fromat support accept "BitsAndBytes"
and if it set to "BitsAndBytes" then

QUANTIZATION

QUANTIZATION

QUANTIZATION

QUANTIZATION must be "bitsandbytes" ("None" will not work)

the

QUANTIZATION

QUANTIZATION

QUANTIZATION

QUANTIZATION options are "None", "AWQ", "SqueezeLLM", "GPTQ"

the error Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization

engine.py           :115  2025-01-21 11:18:49,916 Error initializing vLLM engine: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None

engine.py           :115  2025-01-21 11:18:49,916 Error initializing vLLM engine: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None

engine.py           :115  2025-01-21 11:18:49,916 Error initializing vLLM engine: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None

engine.py           :115  2025-01-21 11:18:49,916 Error initializing vLLM engine: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None

https://github.com/runpod-workers/worker-vllm/issues/99

GitHub

Bitsandbytes support · Issue #99 · runpod-workers/worker-vllm

Hi there! vllm supports bitsandbytes quantization, but there is no bitsandbytes dependency in requirements.txt. Is there any plans to fix that?

M

Mohamed Nagy•1/21/25, 11:43 AM

https://github.com/runpod-workers/worker-vllm/issues/145

GitHub

`worker-config.json`'s QUANTIZATION does not has 'bitsandbytes' opt...

Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization engine.py :26 2025-01-21 11:18:49,619 Engine args: AsyncEngineArgs(model='unsloth/t...

J

Jason•1/21/25, 12:21 PM

oh have you got it to work now?

M

Mohamed Nagy•1/22/25, 7:31 AM

yeah, [https://github.com/mohamednaji7/worker-vllm/tree/main] I add few line to complete the option of using bitsandbytes

and this is a my merge request [https://github.com/runpod-workers/worker-vllm/pull/146]

Deploying bitsandbytes-quantized Models on RunPod Serverless using Custom Docker Image

Similar Threads

Similar Threads

Similar Threads