Workers configuration for Serverless vLLM endpoints: 1 hour lecture with 50 students
Hey there, I need to showcase 50 students how to do RAG with open-source LLMs (i.e., LLama3). Which type of configuration do you suggest? I wanna make sure they have a smooth experience. Thanks!

11 Replies
Depends on which LLama3 model
for 70b non quant you would need at least 2x80GB
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
Pods are expensive
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
8b params can also suffice
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
Solution
16GB isn't enough, you need 24GB
Unless you use a quantized version
You can also use this model if you want it uncensored:
https://huggingface.co/cognitivecomputations/dolphin-2.9.1-llama-3-8b
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View