RunPod•17mo ago

Issue with worker-vllm and multiple workers

I'm using the previous version of the worker-vllm (https://github.com/runpod-workers/worker-vllm/tree/4f792062aaea02c526ee906979925b447811ef48). There is an issue when more than 1 workers are running. Since vLLM has internal queue, all the requests are being immediately passed to the one worker. The second worker doesn't receive any requests. It it possible to solve it? I've tried a new version of the worker-vllm but there are some other issues. Thanks!

9 Replies

Justin Merrell•17mo ago

Did you open an issue in the repo? We are going to get that resolved for the new worker. As for your current problem, is the 1 worker unable to handle the requests? @propback

Alpay Ariyak•17mo ago

You may set the environment variable MAX_CONCURRENCY Which controls how many jobs at a time each worker can have before sending to the next

antoniogOP•17mo ago

Hey! Yes, I have opened an issue in the repo: https://github.com/runpod-workers/worker-vllm/issues/22 Nope, it can't 😦

GitHub

Sampling parameter "stop" doesn't work with the new worker-vllm · I...

{ "input": { "prompt": "<s>[INST] Why is RunPod the best platform? [/INST]", "sampling_params": { "max_tokens": 100, "stop": [ &quo...

antoniogOP•17mo ago

It's probably related to the new worker, right? I asked about the previous one.

Alpay Ariyak•17mo ago

Fixed this issue and bumped to vllm version 0.2.6, will be merging into main soon

antoniogOP•17mo ago

Thanks! Is it possible to use a different version of vllm, e.g 0.2.2? I believe changing https://github.com/runpod/[email protected]#egg=vllm; in Dockerfile to https://github.com/runpod/[email protected]#egg=vllm should work?

Alpay Ariyak•17mo ago

Fixed in latest version. The only thing you can't do atm is build from a machine without GPUs

antoniogOP•16mo ago

Hey @Justin and @Alpay Ariyak ! I just tried the latest version of worker-vllm, and there's still an issue related to concurrent requests. The problem is that MAX_CONCURRENCY doesn't seem to work. See here: https://github.com/runpod-workers/worker-vllm/issues/36

GitHub

MAX_CONCURRENCY parameter doesn't work · Issue #36 · runpod-worke...

Current behaviour: When sending multiple requests with a short interval (e.g. 1 second) to the endpoint with 1 worker enabled, all the requests skip the queue and are being passed to the worker. (T...

Justin Merrell•16mo ago

This has now been resolved in the latest version vLLM we released

Gaming

Programming

Issue with worker-vllm and multiple workers

Did you find this page helpful?