How to queue requests to vLLM pods?
Hi there, I run an AI chat site (https://www.hammerai.com) with ~100k users.
I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (
Currently I have three pods spun up and a Next.js API which uses the Vercel
A few questions:
I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (
Container Image: vllm/vllm-openai:latest) because serverless was getting very expensive.Currently I have three pods spun up and a Next.js API which uses the Vercel
ai SDK to call one of the three pods (I just choose one of the three randomly). This works okay as a fake load balancer, but sometimes the pods are all busy and I fail with:A few questions:
- Is there any suggested way to handle queueing requests?
- Is there any suggested way to distribute requests between pods?
- Are there any nice libraries or example projects which show how to do this?