R
Runpodβ€’13mo ago
Arad

Deploying bitsandbytes-quantized Models on RunPod Serverless using Custom Docker Image

Hey everyone πŸ‘‹ Looking for tips from anyone who's worked with bitsandbytes-quantized models on RunPod's serverless setup. It's not available out of the box with vLLM, and I was wondering if anyone's got it working? Saw a post in the serverless forum about maybe using a custom Docker image for this. For context: I've fine-tuned LLaMA-3.1 70B-instruct using the unsloth library (which utilizes bitsandbytes for quantization) and am looking to deploy it. Any insights would be greatly appreciated! πŸ™
15 Replies
Unknown User
Unknown Userβ€’13mo ago
Message Not Public
Sign In & Join Server To View
Mohamed Nagy
Mohamed Nagyβ€’10mo ago
Any updates!, I want to do the same thing with 3.3 version.
Unknown User
Unknown Userβ€’10mo ago
Message Not Public
Sign In & Join Server To View
Mohamed Nagy
Mohamed Nagyβ€’10mo ago
I tried this but, the vllm-worker checks the variable if its not one of the defined chocies I fork the vllm-wrorker and change it to accept bitsandbytes yes
Unknown User
Unknown Userβ€’10mo ago
Message Not Public
Sign In & Join Server To View
Mohamed Nagy
Mohamed Nagyβ€’10mo ago
in this worker-config.json
"QUANTIZATION": {
"env_var_name": "QUANTIZATION",
"value": "",
"title": "Quantization",
"description": "Method used to quantize the weights.",
"required": false,
"type": "select",
"options": [
{ "value": "None", "label": "None" },
{ "value": "awq", "label": "AWQ" },
{ "value": "squeezellm", "label": "SqueezeLLM" },
{ "value": "gptq", "label": "GPTQ" }
]
},
"QUANTIZATION": {
"env_var_name": "QUANTIZATION",
"value": "",
"title": "Quantization",
"description": "Method used to quantize the weights.",
"required": false,
"type": "select",
"options": [
{ "value": "None", "label": "None" },
{ "value": "awq", "label": "AWQ" },
{ "value": "squeezellm", "label": "SqueezeLLM" },
{ "value": "gptq", "label": "GPTQ" }
]
},
does not has bitsandbytes
Unknown User
Unknown Userβ€’10mo ago
Message Not Public
Sign In & Join Server To View
Mohamed Nagy
Mohamed Nagyβ€’10mo ago
this may work, I am going to test runpod-vllm-worker with LOAD_FORMAT it supports bitsandbytes hope the src/engine will load it, I think it will not because in the github repo they does not handle it fully like in this https://docs.vllm.ai/en/stable/quantization/bnb.html yes I will inform you
Unknown User
Unknown Userβ€’10mo ago
Message Not Public
Sign In & Join Server To View
Mohamed Nagy
Mohamed Nagyβ€’10mo ago
I got the expected error
Unknown User
Unknown Userβ€’10mo ago
Message Not Public
Sign In & Join Server To View
Mohamed Nagy
Mohamed Nagyβ€’10mo ago
the param Load_Fromat support accept "BitsAndBytes" and if it set to "BitsAndBytes" then QUANTIZATION must be "bitsandbytes" ("None" will not work) the QUANTIZATION options are "None", "AWQ", "SqueezeLLM", "GPTQ" the error Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization engine.py :115 2025-01-21 11:18:49,916 Error initializing vLLM engine: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None https://github.com/runpod-workers/worker-vllm/issues/99
GitHub
Bitsandbytes support Β· Issue #99 Β· runpod-workers/worker-vllm
Hi there! vllm supports bitsandbytes quantization, but there is no bitsandbytes dependency in requirements.txt. Is there any plans to fix that?
Mohamed Nagy
Mohamed Nagyβ€’10mo ago
GitHub
worker-config.json's QUANTIZATION does not has 'bitsandbytes' opt...
Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization engine.py :26 2025-01-21 11:18:49,619 Engine args: AsyncEngineArgs(model='unsloth/t...
Unknown User
Unknown Userβ€’10mo ago
Message Not Public
Sign In & Join Server To View
Mohamed Nagy
Mohamed Nagyβ€’10mo ago
yeah, [https://github.com/mohamednaji7/worker-vllm/tree/main] I add few line to complete the option of using bitsandbytes and this is a my merge request [https://github.com/runpod-workers/worker-vllm/pull/146]

Did you find this page helpful?