R
Runpodβ€’2y ago
md

Run Mixtral 8x22B Instruct on vLLM worker

Hello everybody, is it possible to run mixtral 8x22B on vLLM worker i tried to run it on the default configuration with 48 gb GPU A6000, A40 but its taking too long, what are the requirements for running mixtral 8x22B successfully ? this is the model that im trying to run https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
77 Replies
md
mdOPβ€’2y ago
Sorry im new to using GPU's for LLM models
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
thanks for the reply what do you mean by half
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
let me check which GPU would be suitable to run this btw ?
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
sure
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
i think this would also be a good option to set right ? since it will divide the memory
No description
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
ah i see, would there be any substantial decrease in quality if i ran the model in half memory ?
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
cool thanks
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
yeah makes sense @nerdylive looks like mixtral 8x2b requires up to 300 gb of vram, and the highest available gpu is 80gb of vram if it uses half amount of memory which would be 150 gb it should be possible to divide 50gb of vram between 3 workers. idk if thats possible do you know somebody from the team that can help me out here ? my company actually wants to deploy this model for our product.
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
from the mistral discord
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
i see let me search, though could somebody from the runpod team confirm this ?
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
ah ok, i misunderstood this is a community server sorry
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
Yeah i will contact through official channels thanks for all the help appreciate it how do i mark this post as solved ?
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
the only way i have right now is to use a vm with 300gb of vram but it would be costly and not sure if i can find a vm like that, i opted for runpod because it had cheap pricing and easy deployments sure i will post updates here one guy in mistral discord also wanted to split memory in order to run the model between 4x gpus they suggested vLLM for this which is what runpod workers are using i think
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
i havent looked into it yet but they suggested it and TGI
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
yeah
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
@nerdylive this is what i got
No description
md
mdOPβ€’2y ago
https://docs.mistral.ai/deployment/self-deployment/vllm/ in this guide they set the tensor parallel size to 4 i wonder if runpod does it as well
vLLM | Mistral AI Large Language Models
vLLM can be deployed using a docker image we provide, or directly from the python package.
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
let me check
md
mdOPβ€’2y ago
oh this was the option lol
No description
md
mdOPβ€’2y ago
i set it to 3 but still ran out of memory i used 2 gpu per worker as well actually 80gb
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
no i ran out of memory even with the above config
TF
TFβ€’2y ago
I'm trying to do the same thing as you right now lol will update if I figure something out
md
mdOPβ€’2y ago
Thanks alot
Madiator2011
Madiator2011β€’2y ago
@Alpay Ariyak mayby you could help here πŸ™‚
Alpay Ariyak
Alpay Ariyakβ€’2y ago
Hi, you need at least 2x80Gb GPUs afaik
md
mdOPβ€’2y ago
Hey, yes I used 2x 80 GB GPU per worker with 3 workers but I got an error torch.cuda ran out of memory while trying to allocate
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
yeah i will try soon i just selected the option 2 gpu per worker and 80gb H100
Bryan
Bryanβ€’2y ago
Oh? I can only do 2 GPUs per worker with 48GB GPUs, not 80GB GPUs. Are you sure?
No description
Bryan
Bryanβ€’2y ago
Unless you're doing a pod instead of serverless In which case ignore me πŸ™‚
Alpay Ariyak
Alpay Ariyakβ€’2y ago
My apologies, you actually need 4x80gb for 8x22B
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
Alpay Ariyak
Alpay Ariyakβ€’2y ago
Not with the current limits, no
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
Alpay Ariyak
Alpay Ariyakβ€’2y ago
With OpenAI compatibility?
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
Alpay Ariyak
Alpay Ariyakβ€’2y ago
Not sse, regular get request Will return yielded outputs from the worker since last /stream call
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
Alpay Ariyak
Alpay Ariyakβ€’2y ago
What goal do you have in mind?
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
Alpay Ariyak
Alpay Ariyakβ€’2y ago
OpenAI compatibility streaming is through SSE
richterscale9
richterscale9β€’2y ago
Hey, sorry to hijack the thread, I'm also looking into deploying vLLM on RunPod serverless. The landing page indicates that it should be possible to bring your own container, not pay for any idle time, and have <250ms cold boot. Is this true? It sounds too good to be true.
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
Alpay Ariyak
Alpay Ariyakβ€’2y ago
Yes, through flash boot That one is strictly polled
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
richterscale9
richterscale9β€’2y ago
Does this 250ms cold boot time really include everything? Or does it only contain some things, such that the actual cold boot time might be 30 seconds or something? For example, the time to load LLM weights into memory typically takes more than 10 seconds.
Alpay Ariyak
Alpay Ariyakβ€’2y ago
Everything, due to not needing to reload weights
richterscale9
richterscale9β€’2y ago
That's just insane if it really works
Alpay Ariyak
Alpay Ariyakβ€’2y ago
Haha try it out!
richterscale9
richterscale9β€’2y ago
Yeah, reading the docs right now to figure out what is everything i need to do to try it... I currently have a Docker image that spins up a fork of oobabooga web ui, I'm thinking about setting that up for the serverless experiment.
md
mdOPβ€’2y ago
Yeah your actually right, i confused it as 80gb my bad guys even with using dtype half ? we need 4x80 gb ?
Bryan
Bryanβ€’2y ago
8x22B = 176B parameters. At 16bit, 2 bytes per parameter, that's 352GB just for the model parameters At 8bit (1 byte per parameter) it's still 176GB I could be mistaken around this, I'm not an expert on this for sure But my understanding is that you can just fit 8x22B on 4x80GB with 8bit quantization
md
mdOPβ€’2y ago
I see yeah that makes sense i will revisit this in the future
Alpay Ariyak
Alpay Ariyakβ€’2y ago
We’re raising the serverless gpu count limits around next week I believe even up to 10x A40 per worker
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
Alpay Ariyak
Alpay Ariyakβ€’2y ago
Yes, 2x of everything at the very least iirc
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View
md
mdOPβ€’2y ago
nice this will be useful thanks alot
Unknown User
Unknown Userβ€’2y ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?