How to queue requests to vLLM pods?

Hi there, I run an AI chat site (https://www.hammerai.com) with ~100k users.

I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (

Container Image: vllm/vllm-openai:latest

Container Image: vllm/vllm-openai:latest

) because serverless was getting very expensive.

Currently I have three pods spun up and a Next.js API which uses the Vercel

ai

ai

SDK to call one of the three pods (I just choose one of the three randomly). This works okay as a fake load balancer, but sometimes the pods are all busy and I fail with:

Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
    at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
    at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
    at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
    at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
    at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
  cause: undefined,
  reason: 'maxRetriesExceeded',
  errors: [
    APICallError [AI_APICallError]: Bad Gateway

Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
    at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
    at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
    at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
    at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
    at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
  cause: undefined,
  reason: 'maxRetriesExceeded',
  errors: [
    APICallError [AI_APICallError]: Bad Gateway

A few questions:
1. Is there any suggested way to handle queueing requests?
2. Is there any suggested way to distribute requests between pods?
3. Are there any nice libraries or example projects which show how to do this?

Thank you for any help!

Runpod•13mo ago•

2 replies

How to queue requests to vLLM pods?

Hi there, I run an AI chat site (https://www.hammerai.com) with ~100k users.

I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (

Container Image: vllm/vllm-openai:latest

Container Image: vllm/vllm-openai:latest

) because serverless was getting very expensive.

Currently I have three pods spun up and a Next.js API which uses the Vercel

ai

ai

SDK to call one of the three pods (I just choose one of the three randomly). This works okay as a fake load balancer, but sometimes the pods are all busy and I fail with:

Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
    at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
    at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
    at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
    at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
    at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
  cause: undefined,
  reason: 'maxRetriesExceeded',
  errors: [
    APICallError [AI_APICallError]: Bad Gateway

Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
    at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
    at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
    at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
    at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
    at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
  cause: undefined,
  reason: 'maxRetriesExceeded',
  errors: [
    APICallError [AI_APICallError]: Bad Gateway

How to queue requests to vLLM pods?

Similar Threads

How to queue requests to vLLM pods?

Similar Threads

Similar Threads

Similar Threads