R
RunPod•6mo ago
antoniog

Issue with worker-vllm and multiple workers

I'm using the previous version of the worker-vllm (https://github.com/runpod-workers/worker-vllm/tree/4f792062aaea02c526ee906979925b447811ef48). There is an issue when more than 1 workers are running. Since vLLM has internal queue, all the requests are being immediately passed to the one worker. The second worker doesn't receive any requests. It it possible to solve it? I've tried a new version of the worker-vllm but there are some other issues. Thanks!
9 Replies
Justin Merrell
Justin Merrell•6mo ago
Did you open an issue in the repo? We are going to get that resolved for the new worker. As for your current problem, is the 1 worker unable to handle the requests? @propback
Alpay Ariyak
Alpay Ariyak•6mo ago
You may set the environment variable MAX_CONCURRENCY Which controls how many jobs at a time each worker can have before sending to the next
antoniog
antoniog•6mo ago
Hey! Yes, I have opened an issue in the repo: https://github.com/runpod-workers/worker-vllm/issues/22 Nope, it can't 😦
GitHub
Sampling parameter "stop" doesn't work with the new worker-vllm · I...
{ "input": { "prompt": "<s>[INST] Why is RunPod the best platform? [/INST]", "sampling_params": { "max_tokens": 100, "stop": [ &quo...
antoniog
antoniog•6mo ago
It's probably related to the new worker, right? I asked about the previous one.
Alpay Ariyak
Alpay Ariyak•6mo ago
Fixed this issue and bumped to vllm version 0.2.6, will be merging into main soon
antoniog
antoniog•6mo ago
Thanks! Is it possible to use a different version of vllm, e.g 0.2.2? I believe changing https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; in Dockerfile to https://github.com/runpod/vllm-fork-for-sls-worker.git@v0.2.2#egg=vllm should work?
Alpay Ariyak
Alpay Ariyak•6mo ago
Fixed in latest version. The only thing you can't do atm is build from a machine without GPUs
antoniog
antoniog•5mo ago
Hey @Justin and @Alpay Ariyak ! I just tried the latest version of worker-vllm, and there's still an issue related to concurrent requests. The problem is that MAX_CONCURRENCY doesn't seem to work. See here: https://github.com/runpod-workers/worker-vllm/issues/36
GitHub
MAX_CONCURRENCY parameter doesn't work · Issue #36 · runpod-worke...
Current behaviour: When sending multiple requests with a short interval (e.g. 1 second) to the endpoint with 1 worker enabled, all the requests skip the queue and are being passed to the worker. (T...
Justin Merrell
Justin Merrell•5mo ago
This has now been resolved in the latest version vLLM we released