R
RunPod•3w ago
shensmobile

vLLM Dynamic Batching

Hi, I currently use a locally hosted exl2 setup but want to migrate my inference to RunPod serverless. My use case requires processing hundreds, sometimes thousands of prompts at the same time. I'm currently taking advantage of exl2's dynamic batching to figure out the optimal collating for batch processing. Does vLLM backend support taking in thousands of prompts (some of which could be close to 4096 tokens long) through the openAI API and process them as a job and return the results as a batch?
29 Replies
J.
J.•3w ago
Im not sure specifically about VLLM, but this sounds like a job for serverless and also you are able to define concurrency https://docs.runpod.io/serverless/workers/concurrent-handler So your worker is able to accept let's say 5 requests, before spinning up more workers to accept more, and obviously they'll keep pulling requests off the queue as they are done
Build a concurrent handler | RunPod Documentation
Learn how to implement concurrent handlers to process multiple requests simultaneously with a single worker.
shensmobile
shensmobileOP•2w ago
In this instance, these 5 requests, would they just essentially be processed sequentially? Or would they be processed concurrently? If a worker is using a single GPU, I'm not quite sure how the requests could be processed in parallel without the backend (like vLLM) handling it
J.
J.•2w ago
Concurrently if you are using something similar to: https://docs.runpod.io/serverless/workers/concurrent-handler An example when I played with before:
import runpod
import asyncio
import random


async def process_request(job):
await asyncio.sleep(10) # Simulate processing time
return f"Processed: {job['data']}"

def adjust_concurrency(current_concurrency):
"""
Adjusts the concurrency level based on the current request rate.
"""
return 5

# Start the serverless function with the handler and concurrency modifier
runpod.serverless.start({
"handler": process_request,
"concurrency_modifier": adjust_concurrency
})
import runpod
import asyncio
import random


async def process_request(job):
await asyncio.sleep(10) # Simulate processing time
return f"Processed: {job['data']}"

def adjust_concurrency(current_concurrency):
"""
Adjusts the concurrency level based on the current request rate.
"""
return 5

# Start the serverless function with the handler and concurrency modifier
runpod.serverless.start({
"handler": process_request,
"concurrency_modifier": adjust_concurrency
})
They using asyncio to manage the async requests
Build a concurrent handler | RunPod Documentation
Learn how to implement concurrent handlers to process multiple requests simultaneously with a single worker.
J.
J.•2w ago
The bigger problem is that if you run out of memory. I had instances, where for ex. i set the concurrency too high, the gpu runs out of memory, and everything crashes
shensmobile
shensmobileOP•2w ago
Ah yeah, this is why I love the Exllamav2 quant/inference engine. The dynamic batch engine just lets you dump data in and it sorts it out for you to fit within GPU memory. I might be better off running my own worker that utilizes exllamav2 It's not the ideal way to handle concurrency, but it'll let me be more brainless.
J.
J.•2w ago
this sounds decent option Ive never tested it myself, but just looking at the way the concurrency is set up it looks like it just basically calling the python function up to the concurrency limit from the managed queue runpod has. So since it's on the same machine, I imagine the inference engine you are using would still work. (or I guess the other way is just to spin up the worker with a dummy request, and have it manually pull off a self-maintained queue off platform)
shensmobile
shensmobileOP•2w ago
Maybe I'm thinking about this incorrectly or have a poor understanding of how GPUs distribute their VRAM over multiple processes, but from my understanding, the advantage to batch inference was that repeated tokens in the prompt could be processed once and shared across all other prompts in the batch. The reason I didn't want to send individual prompts in sequentially to be processed concurrently was that I don't think that separate processes/functions could share that VRAM
J.
J.•2w ago
Ah, no that makes sense. Huh. I never thought about it that way. I believe you are correct, I was mixing the idea of "concurrency" vs "batching" the request in order to do the memory management that you are talking about. I do think the best solution in your case is essentially just a dummy request to spin up a worker, and then you are manually pulling requests from your own queue or server, to do your own batching logic
shensmobile
shensmobileOP•2w ago
OK and I presume that there's nothing already built into the vLLM worker to support this?
J.
J.•2w ago
(oof, honestly not sure). My expertise is not in VLLM. https://docs.vllm.ai/en/latest/ It does look like, just searching around that VLLM supports continous batching
shensmobile
shensmobileOP•2w ago
Yeah they even have a form of dynamic batching in one of the newer versions, I'm just curious how the worker itself interacts with vLLM to utilize that functionality But you've helped me out a great deal already Maybe one more thing, that's not related to vLLM or AI at all If I'm going to send in 100's of prompts at once, I could be sending over 100's of thousands of tokens at a time I don't think that will fit in an HTTP request right?
J.
J.•2w ago
Haha, actually not sure, 🤔. If you are managing the request urself realistically there is no bottleneck on runpod’s side since it whatever ur pulling / can fit in the http request. Could prob do some sort of compression especially since is text i imagine is very easy, will just require ur own server to do that management Also text http requests can be quite large from my experience anyways, ive never actually hit a bottleneck myself
Dj
Dj•2w ago
I think the largest thing you can send us is about 10MB total
yhlong00000
yhlong00000•2w ago
Our vLLM quick deploy support batching.
3WaD
3WaD•2w ago
The vLLM supports continuous batching just fine, and by pairing it with the concurrent handler, it works on RunPod serverless, too. Just be aware that the official RunPod vLLM image has a static concurrency modifier set with a number. It doesn't dynamically adjust it with each request based on available concurrency in the vLLM engine.
shensmobile
shensmobileOP•2w ago
So with concurrency, does that mean I can just queue up multiple requests and the serverless worker handles it? Or do I need to send up all prompts in one batch?
Dj
Dj•2w ago
shensmobile
shensmobileOP•2w ago
Thank you 🙂 I'm just going to take a bit of time to read through the documentation. I'm so used to running locally, wrapping my head around how the serverless workers handle this is confusing. especially when I need to optimize throughput
3WaD
3WaD•2w ago
Yes. vLLM supports efficient continuous batching for both. You can either send a "classic" batch of many prompts in a single request, or many requests with individual prompts. Each just has a bit different approach and limits with RunPod.
shensmobile
shensmobileOP•2w ago
oh excellent! Thank you. With the vLLM worker, what's the best way to send up a "classic" batch of many prompts in a single request? Is it possible through the openai API?
3WaD
3WaD•2w ago
Yes. You can use https://api.runpod.ai/v2/your-endpoint-id/openai/v1/completions (not /v1/chat/completions - which only accepts a single chat conversation per request, so if you have conversations you should use dynamic batching with multiple requests). Send multiple prompts as a list to the "prompt" parameter like this:
{
    "model": "Model-Name",
    "prompt":["Prompt1", "Prompt2", "..."]
}
{
    "model": "Model-Name",
    "prompt":["Prompt1", "Prompt2", "..."]
}
Or with client libraries:
completion = client.completions.create(
model="Model-Name",
prompt=["Prompt1", "Prompt2", "..."],
)
completion = client.completions.create(
model="Model-Name",
prompt=["Prompt1", "Prompt2", "..."],
)
The output is looking like this:
{
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "text": "output text1"
        },
        {
            "finish_reason": "length",
            "index": 1,
            "text": "output text2"
        }
    ],
    "created": 123timestamp,
    "id": "jobid",
    "model": "Model-Name",
    "object": "text_completion",
    "usage": {
        "completion_tokens": 1,
        "prompt_tokens": 1,
        "total_tokens": 2
    }
}
{
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "text": "output text1"
        },
        {
            "finish_reason": "length",
            "index": 1,
            "text": "output text2"
        }
    ],
    "created": 123timestamp,
    "id": "jobid",
    "model": "Model-Name",
    "object": "text_completion",
    "usage": {
        "completion_tokens": 1,
        "prompt_tokens": 1,
        "total_tokens": 2
    }
}
shensmobile
shensmobileOP•2w ago
You are my hero Thank you Is there a limit on the number of chars I can submit? I'm worried that my request will get truncated.
J.
J.•2w ago
Endpoint operations | RunPod Documentation
Learn how to effectively manage RunPod Serverless jobs throughout their lifecycle, from submission to completion, using asynchronous and synchronous endpoints, status tracking, cancellation, and streaming capabilities.
shensmobile
shensmobileOP•2w ago
neato 🙂
J.
J.•2w ago
Yeah, for text, I'd be surprised if you hit it
shensmobile
shensmobileOP•2w ago
And that can be sent as a request, not as a jsonl or external file payload or something? Hey, you watch me
J.
J.•2w ago
@shensmobile Yup! 10mb on the whole request json you are sending, but I imagine it would primarily just be the body of that request (your prompts etc). If it really is an issue, you could even do a compression on the client side, or if you have an intermediary backend I assume, and then on the runpod side do a decompression. I did something similar for audio / video files (compression/decompression)
3WaD
3WaD•2w ago
Or, if you really have so much data to input, you can modify the handler to download the data from an external source (S3, your own API, etc.) not going through RunPod API. In that case, there is no limit as far as I know. That's a known workaround for big images in img gen endpoints.
shensmobile
shensmobileOP•2w ago
I mean, if batch support is already built into the vLLM serverless worker, I'd rather not modify it 🙂 and 10mb should be fine for the amount of data that I'm sending. It sounds like I just need to implement and go! Especially now that vLLM supports int8, I have a few more quantization options, although one day I would love to figure out how to set up my own exllama v2/v3 worker as well to have that option

Did you find this page helpful?