R
RunPod•4mo ago
Builderman

Mixtral Possible?

Wondering if it's possible to run AWQ mixtral on serverless with good speed
16 Replies
wizardjoe
wizardjoe•4mo ago
I'm currently running this with decent speeds, but you'll need to set your min and max workers accordingly depending on the load you expect
Builderman
Builderman•4mo ago
what GPU do you use?
wizardjoe
wizardjoe•4mo ago
I have min workers set to at least 1 so that it doesnt spend time booting, which is where the majority of the latency will be I use 48GB don't select anything under that
Builderman
Builderman•4mo ago
kk thanks
interesting_friend_5
interesting_friend_5•4mo ago
I have been trying to run Mixtral AWQ but am not getting any results returned in the completed message. I had not trouble with Llama 2, but am struggling to get Mixtral working. Anyone else have this issue?
justin
justin•4mo ago
What repository are u running it with? Just wondering? A custom repo or?
interesting_friend_5
interesting_friend_5•4mo ago
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
justin
justin•4mo ago
Ah, a great person to ask would be @Alpay Ariyak then! 🙂 I'll ping him into this thread, as maybe you can ask him more question, He is Runpod staff who is familiar / the main one it seems working on the vllm.
interesting_friend_5
interesting_friend_5•4mo ago
Awesome! Thank you!
justin
justin•4mo ago
Just as a question, do you want to share your build command + also the input that you are sending? Might be helpful to debug for when he does get a chance to take a look. Or whatever steps + what you are getting.
interesting_friend_5
interesting_friend_5•4mo ago
I've set the environment variables MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ and QUANTIZATION=awq. I've got no other custom commands.
justin
justin•4mo ago
What input are u sending in? And I think that's great. Hopefully alpay will be able to ping in then 🙂 since he is just knowledgeable on that repo.
interesting_friend_5
interesting_friend_5•4mo ago
prompt = "Tell me about AI" prompt_template=f'''[INST] {prompt} [/INST] ''' prompt = prompt_template.format(prompt=prompt) payload = { "input": { "prompt": prompt, "sampling_params": { "max_tokens": 1000, "n": 1, "presence_penalty": 0.2, "frequency_penalty": 0.7, "temperature": 1.0, } } }
Alpay Ariyak
Alpay Ariyak•4mo ago
Hi, what do the logs show? One suggestion I've seen with quants is turning trust remote code on, which can be done by setting TRUST_REMOTE_CODE to 1 Could you share the actual job outputs as well
ashleyk
ashleyk•4mo ago
I don't know about the quantized models but even the non-quantized Mixtral model requires trust_remote_code to be enabled.
Alpay Ariyak
Alpay Ariyak•4mo ago
That’s good to know, thanks for pointing that out