R
RunPod•5mo ago
JorgeG

Worker handling multiple requests concurrently

I have an application where a single worker can handle multiple requests concurrently. I'm not finding a way of allowing this in runpod serverless. The multiple requests are always queued when using a single worker. Is this possible?
9 Replies
flash-singh
flash-singh•5mo ago
you can search here, we have answered this multiple times, also use #🤖|ask-ai it should be able to answer it
JorgeG
JorgeG•5mo ago
Thanks @flash-singh . I did search but didn't return any results. Tried different keywords, now I got one post that points me towards this: https://github.com/runpod-workers/worker-vllm/blob/main/src/handler.py So I guess the magic bit is the "concurrency_modifier" arg in serverless start. FYI, this argument is not documented anywhere in the runpod.io docs, at least I couldn't find it.
GitHub
worker-vllm/src/handler.py at main · runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - runpod-workers/worker-vllm
ashleyk
ashleyk•5mo ago
Would be useful for it to be documented, I agree
flash-singh
flash-singh•5mo ago
yes thats it, @Justin lets document this in serverless docs and git
rafael21@
rafael21@•5mo ago
Is it possible? One worker handling more than one request concurrently?
flash-singh
flash-singh•5mo ago
yes he shared the link to worker which uses that
Justin Merrell
Justin Merrell•5mo ago
Got it on the backlog, will work with @PatrickR to get this implemented.
antoniog
antoniog•5mo ago
It also seems that concurrency_modifier doesn't work in this example. Please see this issue: https://github.com/runpod-workers/worker-vllm/issues/36
GitHub
MAX_CONCURRENCY parameter doesn't work · Issue #36 · runpod-worke...
Current behaviour: When sending multiple requests with a short interval (e.g. 1 second) to the endpoint with 1 worker enabled, all the requests skip the queue and are being passed to the worker. (T...
rafael21@
rafael21@•5mo ago
Justin, is this documented, please ? I mean the way to have one worker handling more than one request concurrently