R
Runpod10mo ago
j

How to queue requests to vLLM pods?

Hi there, I run an AI chat site (https://www.hammerai.com) with ~100k users. I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (Container Image: vllm/vllm-openai:latest) because serverless was getting very expensive. Currently I have three pods spun up and a Next.js API which uses the Vercel ai SDK to call one of the three pods (I just choose one of the three randomly). This works okay as a fake load balancer, but sometimes the pods are all busy and I fail with:
Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
cause: undefined,
reason: 'maxRetriesExceeded',
errors: [
APICallError [AI_APICallError]: Bad Gateway
Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
cause: undefined,
reason: 'maxRetriesExceeded',
errors: [
APICallError [AI_APICallError]: Bad Gateway
A few questions: 1. Is there any suggested way to handle queueing requests? 2. Is there any suggested way to distribute requests between pods? 3. Are there any nice libraries or example projects which show how to do this? Thank you for any help!
1 Reply
Unknown User
Unknown User10mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?