Runpod•2y ago•

125 replies

Run Mixtral 8x22B Instruct on vLLM worker

Hello everybody, is it possible to run mixtral 8x22B on vLLM worker i tried to run it on the default configuration with 48 gb GPU A6000, A40 but its taking too long, what are the requirements for running mixtral 8x22B successfully ? this is the model that im trying to run https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1

mistralai/Mixtral-8x22B-Instruct-v0.1 · Hugging Face

mdOP•5/12/24, 8:43 AM

Sorry im new to using GPU's for LLM models

Mmd Hello everybody, is it possible to run mixtral 8x22B on vLLM worker i tried to r...

Jason•5/12/24, 9:48 AM

oh it actually needs a bunch of vram to run

Jason•5/12/24, 9:48 AM

you can try using half for running it in lower vrams

Mmd Sorry im new to using GPU's for LLM models

Jason•5/12/24, 9:48 AM

all good

JJason you can try using half for running it in lower vrams

mdOP•5/12/24, 9:49 AM

thanks for the reply what do you mean by half

Jason•5/12/24, 9:49 AM

in the quantization part

Jason•5/12/24, 9:49 AM

use any to run it in lower vrams

mdOP•5/12/24, 9:49 AM

let me check which GPU would be suitable to run this btw ?

Jason•5/12/24, 9:49 AM

oh wait

Jason•5/12/24, 9:50 AM

the DTYPE i mean*

Jason•5/12/24, 9:50 AM

Jason•5/12/24, 9:50 AM

also this too

Jason•5/12/24, 9:50 AM

Jason•5/12/24, 9:51 AM

try experimenting with those in the env variables of your endpoint

mdOP•5/12/24, 9:51 AM

sure

Jason•5/12/24, 9:51 AM

Environment variables | RunPod Documentation

Environment variables configure your vLLM Worker by providing control over model selection, access credentials, and operational parameters necessary for optimal Worker performance.

mdOP•5/12/24, 9:52 AM

i think this would also be a good option to set right ? since it will divide the memory

Jason•5/12/24, 9:54 AM

never tried vllm worker yet tbh, sure if you want to try it go ahead

Mmd i think this would also be a good option to set right ? since it will divide the...

Jason•5/12/24, 9:55 AM

yeah seems like a good option to try

JJason never tried vllm worker yet tbh, sure if you want to try it go ahead

mdOP•5/12/24, 9:55 AM

ah i see, would there be any substantial decrease in quality if i ran the model in half memory ?

Jason•5/12/24, 9:57 AM

maybe try browsing around on quant, etc

JJason maybe try browsing around on quant, etc

mdOP•5/12/24, 9:59 AM

cool thanks

Jason•5/12/24, 10:04 AM

Because I don't know about them but yeah i think it will if you use the lower dtype

mdOP•5/12/24, 10:09 AM

yeah makes sense

mdOP•5/12/24, 10:36 AM

@nerdylive looks like mixtral 8x2b requires up to 300 gb of vram, and the highest available gpu is 80gb of vram if it uses half amount of memory which would be 150 gb it should be possible to divide 50gb of vram between 3 workers. idk if thats possible do you know somebody from the team that can help me out here ? my company actually wants to deploy this model for our product.

Mmd @nerdylive looks like mixtral 8x2b requires up to 300 gb of vram, and the highes...

Jason•5/12/24, 10:39 AM

Wew where did you get that estimate from

Jason•5/12/24, 10:39 AM

Yeah it's a huge model

JJason Wew where did you get that estimate from

mdOP•5/12/24, 10:39 AM

from the mistral discord

Mmd @nerdylive looks like mixtral 8x2b requires up to 300 gb of vram, and the highes...

Jason•5/12/24, 10:39 AM

Nah, it's not possible yet to divide them onto 3 workers I think

Mmd from the mistral discord

Jason•5/12/24, 10:39 AM

I c

Jason•5/12/24, 10:40 AM

Pods is possible to use multiple gpu

JJason Nah, it's not possible yet to divide them onto 3 workers I think

Jason•5/12/24, 10:40 AM

Or maybe explore accelerate for this ( not sure )

JJason Nah, it's not possible yet to divide them onto 3 workers I think

mdOP•5/12/24, 10:40 AM

i see let me search, though could somebody from the runpod team confirm this ?

Mmd i see let me search, though could somebody from the runpod team confirm this ?

Jason•5/12/24, 10:41 AM

Maybe contact support for that

Jason•5/12/24, 10:42 AM

Like from the website

mdOP•5/12/24, 10:42 AM

ah ok, i misunderstood this is a community server

mdOP•5/12/24, 10:43 AM

sorry

Mmd ah ok, i misunderstood this is a community server

Jason•5/12/24, 10:46 AM

Yeah there are some staffs here but they are easier to manage support request via website support

Jason•5/12/24, 10:46 AM

It's fine

JJason Yeah there are some staffs here but they are easier to manage support request vi...

mdOP•5/12/24, 10:46 AM

Yeah i will contact through official channels thanks for all the help appreciate it

mdOP•5/12/24, 10:47 AM

how do i mark this post as solved ?

Jason•5/12/24, 10:48 AM

Up to you did you find a way to run that model yet?

Jason•5/12/24, 10:48 AM

Without the worker splitting idea

Jason•5/12/24, 10:48 AM

I'd like to know your updates too haha, I think dont mark it first

JJason Up to you did you find a way to run that model yet?

mdOP•5/12/24, 10:50 AM

the only way i have right now is to use a vm with 300gb of vram but it would be costly and not sure if i can find a vm like that, i opted for runpod because it had cheap pricing and easy deployments

JJason I'd like to know your updates too haha, I think dont mark it first

mdOP•5/12/24, 10:50 AM

sure i will post updates here

JJason Without the worker splitting idea

mdOP•5/12/24, 10:51 AM

one guy in mistral discord also wanted to split memory in order to run the model between 4x gpus

mdOP•5/12/24, 10:52 AM

they suggested vLLM for this which is what runpod workers are using i think

Jason•5/12/24, 10:52 AM

Vllm supports that?

mdOP•5/12/24, 10:53 AM

i havent looked into it yet but they suggested it and TGI

Run Mixtral 8x22B Instruct on vLLM worker

Similar Threads

Similar Threads

Similar Threads