R
RunPod4mo ago
ribbit

LLM inference on serverless solution

Hi, need some suggestion on serving LLM model on serverless. I have several questions: 1. Is there any guide or example project I can follow so that can infer effectively on runpod serverless? 2. Is it recommended to use frameworks like TGI or vLLM with runpod? If so why? I'd like maximum control on the inference code so I have not tried any of those frameworks Thanks!
7 Replies
ashleyk
ashleyk4mo ago
RunPod have created a vLLM worker that you can use for serverless: https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
ribbit
ribbit4mo ago
Thanks! but I heard that vllm does not support quantized models? One of the reason I'd want maximum control to the inference code is that I want to run quantized models on other lib than transformers (exllama, etc.)
ashleyk
ashleyk4mo ago
It does support some quantization types
ashleyk
ashleyk4mo ago
No description
ribbit
ribbit4mo ago
I see, seems like exl is not yet supported, what are the real advantages of using vllm tho?
ashleyk
ashleyk4mo ago
concurrency
ribbit
ribbit4mo ago
I see, will explore more on that, thanks!