vLLM Dynamic Batching
Hi, I currently use a locally hosted exl2 setup but want to migrate my inference to RunPod serverless. My use case requires processing hundreds, sometimes thousands of prompts at the same time. I'm currently taking advantage of exl2's dynamic batching to figure out the optimal collating for batch processing. Does vLLM backend support taking in thousands of prompts (some of which could be close to 4096 tokens long) through the openAI API and process them as a job and return the results as a batch?
29 Replies
Im not sure specifically about VLLM, but this sounds like a job for serverless and also you are able to define concurrency
https://docs.runpod.io/serverless/workers/concurrent-handler
So your worker is able to accept let's say 5 requests, before spinning up more workers to accept more, and obviously they'll keep pulling requests off the queue as they are done
Build a concurrent handler | RunPod Documentation
Learn how to implement concurrent handlers to process multiple requests simultaneously with a single worker.
In this instance, these 5 requests, would they just essentially be processed sequentially? Or would they be processed concurrently?
If a worker is using a single GPU, I'm not quite sure how the requests could be processed in parallel without the backend (like vLLM) handling it
Concurrently if you are using something similar to:
https://docs.runpod.io/serverless/workers/concurrent-handler
An example when I played with before:
They using asyncio to manage the async requests
Build a concurrent handler | RunPod Documentation
Learn how to implement concurrent handlers to process multiple requests simultaneously with a single worker.
The bigger problem is that if you run out of memory. I had instances, where for ex. i set the concurrency too high, the gpu runs out of memory, and everything crashes
Ah yeah, this is why I love the Exllamav2 quant/inference engine. The dynamic batch engine just lets you dump data in and it sorts it out for you to fit within GPU memory.
I might be better off running my own worker that utilizes exllamav2
It's not the ideal way to handle concurrency, but it'll let me be more brainless.
this sounds decent option
Ive never tested it myself, but just looking at the way the concurrency is set up it looks like it just basically calling the python function up to the concurrency limit from the managed queue runpod has.
So since it's on the same machine, I imagine the inference engine you are using would still work.
(or I guess the other way is just to spin up the worker with a dummy request, and have it manually pull off a self-maintained queue off platform)
Maybe I'm thinking about this incorrectly or have a poor understanding of how GPUs distribute their VRAM over multiple processes, but from my understanding, the advantage to batch inference was that repeated tokens in the prompt could be processed once and shared across all other prompts in the batch.
The reason I didn't want to send individual prompts in sequentially to be processed concurrently was that I don't think that separate processes/functions could share that VRAM
Ah, no that makes sense. Huh. I never thought about it that way.
I believe you are correct, I was mixing the idea of "concurrency" vs "batching" the request in order to do the memory management that you are talking about.
I do think the best solution in your case is essentially just a dummy request to spin up a worker, and then you are manually pulling requests from your own queue or server, to do your own batching logic
OK and I presume that there's nothing already built into the vLLM worker to support this?
(oof, honestly not sure). My expertise is not in VLLM.
https://docs.vllm.ai/en/latest/
It does look like, just searching around that VLLM supports continous batching
Yeah they even have a form of dynamic batching in one of the newer versions, I'm just curious how the worker itself interacts with vLLM to utilize that functionality
But you've helped me out a great deal already
Maybe one more thing, that's not related to vLLM or AI at all
If I'm going to send in 100's of prompts at once, I could be sending over 100's of thousands of tokens at a time
I don't think that will fit in an HTTP request right?
Haha, actually not sure, 🤔. If you are managing the request urself realistically there is no bottleneck on runpod’s side since it whatever ur pulling / can fit in the http request.
Could prob do some sort of compression especially since is text i imagine is very easy, will just require ur own server to do that management
Also text http requests can be quite large from my experience anyways, ive never actually hit a bottleneck myself
I think the largest thing you can send us is about 10MB total
Our vLLM quick deploy support batching.
The vLLM supports continuous batching just fine, and by pairing it with the concurrent handler, it works on RunPod serverless, too. Just be aware that the official RunPod vLLM image has a static concurrency modifier set with a number. It doesn't dynamically adjust it with each request based on available concurrency in the vLLM engine.
So with concurrency, does that mean I can just queue up multiple requests and the serverless worker handles it? Or do I need to send up all prompts in one batch?
👀 https://docs.runpod.io/serverless/development/concurrency
Maybe better: https://docs.runpod.io/serverless/workers/concurrent-handler
Concurrency | RunPod Documentation
Learn how to leverage concurrent workers in your local testing environment
Thank you 🙂
I'm just going to take a bit of time to read through the documentation. I'm so used to running locally, wrapping my head around how the serverless workers handle this is confusing.
especially when I need to optimize throughput
Yes. vLLM supports efficient continuous batching for both. You can either send a "classic" batch of many prompts in a single request, or many requests with individual prompts. Each just has a bit different approach and limits with RunPod.
oh excellent! Thank you. With the vLLM worker, what's the best way to send up a "classic" batch of many prompts in a single request?
Is it possible through the openai API?
Yes. You can use https://api.runpod.ai/v2/your-endpoint-id/openai/v1/completions (not /v1/chat/completions - which only accepts a single chat conversation per request, so if you have conversations you should use dynamic batching with multiple requests).
Send multiple prompts as a list to the "prompt" parameter like this:
Or with client libraries:
The output is looking like this:
You are my hero
Thank you
Is there a limit on the number of chars I can submit? I'm worried that my request will get truncated.
https://docs.runpod.io/serverless/endpoints/operations#asynchronous-jobs-run
10Mb for the request
Endpoint operations | RunPod Documentation
Learn how to effectively manage RunPod Serverless jobs throughout their lifecycle, from submission to completion, using asynchronous and synchronous endpoints, status tracking, cancellation, and streaming capabilities.
neato 🙂
Yeah, for text, I'd be surprised if you hit it
And that can be sent as a request, not as a jsonl or external file payload or something?
Hey, you watch me
@shensmobile Yup! 10mb on the whole request json you are sending, but I imagine it would primarily just be the body of that request (your prompts etc).
If it really is an issue, you could even do a compression on the client side, or if you have an intermediary backend I assume, and then on the runpod side do a decompression.
I did something similar for audio / video files (compression/decompression)
Or, if you really have so much data to input, you can modify the handler to download the data from an external source (S3, your own API, etc.) not going through RunPod API. In that case, there is no limit as far as I know. That's a known workaround for big images in img gen endpoints.
I mean, if batch support is already built into the vLLM serverless worker, I'd rather not modify it 🙂 and 10mb should be fine for the amount of data that I'm sending. It sounds like I just need to implement and go!
Especially now that vLLM supports int8, I have a few more quantization options, although one day I would love to figure out how to set up my own exllama v2/v3 worker as well to have that option