RunPod•16mo ago

Worker handling multiple requests concurrently

I have an application where a single worker can handle multiple requests concurrently. I'm not finding a way of allowing this in runpod serverless. The multiple requests are always queued when using a single worker. Is this possible?

9 Replies

flash-singh•16mo ago

you can search here, we have answered this multiple times, also use #🤖｜ask-ai it should be able to answer it

JorgeGOP•16mo ago

Thanks @flash-singh . I did search but didn't return any results. Tried different keywords, now I got one post that points me towards this: https://github.com/runpod-workers/worker-vllm/blob/main/src/handler.py So I guess the magic bit is the "concurrency_modifier" arg in serverless start. FYI, this argument is not documented anywhere in the runpod.io docs, at least I couldn't find it.

GitHub

worker-vllm/src/handler.py at main · runpod-workers/worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - runpod-workers/worker-vllm

ashleyk•16mo ago

Would be useful for it to be documented, I agree

flash-singh•16mo ago

yes thats it, @Justin lets document this in serverless docs and git

rafael21@•16mo ago

Is it possible? One worker handling more than one request concurrently?

flash-singh•16mo ago

yes he shared the link to worker which uses that

Justin Merrell•16mo ago

Got it on the backlog, will work with @PatrickR to get this implemented.

antoniog•16mo ago

It also seems that concurrency_modifier doesn't work in this example. Please see this issue: https://github.com/runpod-workers/worker-vllm/issues/36

GitHub

MAX_CONCURRENCY parameter doesn't work · Issue #36 · runpod-worke...

Current behaviour: When sending multiple requests with a short interval (e.g. 1 second) to the endpoint with 1 worker enabled, all the requests skip the queue and are being passed to the worker. (T...

rafael21@•16mo ago

Justin, is this documented, please ? I mean the way to have one worker handling more than one request concurrently

Gaming

Programming

Worker handling multiple requests concurrently

Did you find this page helpful?