Runpod•2y ago

LLM inference on serverless solution

Hi, need some suggestion on serving LLM model on serverless. I have several questions: 1. Is there any guide or example project I can follow so that can infer effectively on runpod serverless? 2. Is it recommended to use frameworks like TGI or vLLM with runpod? If so why? I'd like maximum control on the inference code so I have not tried any of those frameworks Thanks!

7 Replies

ashleyk•2y ago

RunPod have created a vLLM worker that you can use for serverless: https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

ribbitOP•2y ago

Thanks! but I heard that vllm does not support quantized models? One of the reason I'd want maximum control to the inference code is that I want to run quantized models on other lib than transformers (exllama, etc.)

ashleyk•2y ago

It does support some quantization types

ashleyk•2y ago

ribbitOP•2y ago

I see, seems like exl is not yet supported, what are the real advantages of using vllm tho?

ashleyk•2y ago

concurrency

ribbitOP•2y ago

I see, will explore more on that, thanks!

Gaming

Programming

LLM inference on serverless solution

Did you find this page helpful?