Serverless VLLM concurrency issue

Hello everyone, i deployed a serverless vllm (gemma 12b model) through runpod ui. withj 2 workers of A100 80GB vram. if i send two requests at the same time, they both become IN PROGRESS but i recieve the ouput stream of one first, the second always waits for the first to finish then i start recieveing the tokens stream. why is it behaving live this?
47 Replies
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Abdelrhman Nile
Abdelrhman NileOP7mo ago
@Jason it is the same worker, is there any way i can make it respond to both of them at the same time?
riverfog7
riverfog77mo ago
you cant without changing code bc of the natre of llms if you give them the same input, the output may vary in length, which affects generation time
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Abdelrhman Nile
Abdelrhman NileOP7mo ago
i am using A100 80GB vram and it is supposed to be very fast! before i used to deploy the same model on A100 40gb vram on gcp with vllm it it had no problem handling concurrent requests DEFAULT_BATCH_SIZE or BATCH_SIZE ?
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Abdelrhman Nile
Abdelrhman NileOP7mo ago
yes same everything
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Abdelrhman Nile
Abdelrhman NileOP7mo ago
my issue is not really the speed, the speed is decent when there is no cold start, my issue is handling more than one request at the same time
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Abdelrhman Nile
Abdelrhman NileOP7mo ago
yes first request starts streaming, second request from another client always starts after the first one finishes
riverfog7
riverfog77mo ago
with two workers?
Abdelrhman Nile
Abdelrhman NileOP7mo ago
ill do some benchmarks and provide you with the number s 2 and 3 tried both
riverfog7
riverfog77mo ago
can you check vllm logs it should say metrics like current running req, waiting req etc and tok/s
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
do we need to set batch size with vllm workers?
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
vllm intellegently does batching until its kv cache is full
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Abdelrhman Nile
Abdelrhman NileOP7mo ago
no i mean i was configuring the endpoint to scale up to multiple workers if needed
Abdelrhman Nile
Abdelrhman NileOP7mo ago
logs when sending 2 requests
Abdelrhman Nile
Abdelrhman NileOP7mo ago
No description
Abdelrhman Nile
Abdelrhman NileOP7mo ago
No description
Abdelrhman Nile
Abdelrhman NileOP7mo ago
right now it is configured to only have one worker
riverfog7
riverfog77mo ago
try default batch size to 10
Abdelrhman Nile
Abdelrhman NileOP7mo ago
i ma setting default batch size to 1 because i noticed streaming used to send very big chunks of tokens
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
lol
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Abdelrhman Nile
Abdelrhman NileOP7mo ago
i tried it with 50 and 256
riverfog7
riverfog77mo ago
that setting means only 1 request should be processed cocurrently
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Abdelrhman Nile
Abdelrhman NileOP7mo ago
same behavior of not handling multiple requests with default batch size set to 50 and 256
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
no no sorry fo rthe misinformation
Abdelrhman Nile
Abdelrhman NileOP7mo ago
but both requests status appear as IN PROGRESS
riverfog7
riverfog77mo ago
its the batch size for streaming tokens
riverfog7
riverfog77mo ago
This is the real one but you didnt set it so should be fine
No description
Abdelrhman Nile
Abdelrhman NileOP7mo ago
are you sure? i tried it with 5, 10, 50, 256 and i got the same behaviour but let me try it one more time to confirm
riverfog7
riverfog77mo ago
uhh i mean it doesnt matter if you set it to 5 / 10 / etc
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
because it is related to token streaming, not the actual requests @Abdelrhman Nile maybe can you try spamming requests? like 50+?
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog77mo ago
set the max workers to 1 and then spam requests
Abdelrhman Nile
Abdelrhman NileOP7mo ago
i kinda did that with vllm benchmark serving script, let me share the results with you
============ Serving Benchmark Result ============
Successful requests: 857
Benchmark duration (s): 95.82
Total input tokens: 877568
Total generated tokens: 68965
Request throughput (req/s): 8.94
Output token throughput (tok/s): 719.70
Total Token throughput (tok/s): 9877.74
---------------Time to First Token----------------
Mean TTFT (ms): 42451.61
Median TTFT (ms): 42317.61
P99 TTFT (ms): 77811.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 472.19
Median TPOT (ms): 190.87
P99 TPOT (ms): 3881.05
---------------Inter-token Latency----------------
Mean ITL (ms): 182.12
Median ITL (ms): 0.01
P99 ITL (ms): 4703.27
==================================================
============ Serving Benchmark Result ============
Successful requests: 857
Benchmark duration (s): 95.82
Total input tokens: 877568
Total generated tokens: 68965
Request throughput (req/s): 8.94
Output token throughput (tok/s): 719.70
Total Token throughput (tok/s): 9877.74
---------------Time to First Token----------------
Mean TTFT (ms): 42451.61
Median TTFT (ms): 42317.61
P99 TTFT (ms): 77811.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 472.19
Median TPOT (ms): 190.87
P99 TPOT (ms): 3881.05
---------------Inter-token Latency----------------
Mean ITL (ms): 182.12
Median ITL (ms): 0.01
P99 ITL (ms): 4703.27
==================================================
configuration was max workers = 3 and i was NOT setting default batch size , it was left on deafult which i believe is 50 also the script sent 1000 requests only 857 was succesful same model, same benchmark but on gcp a100 40 vram machine
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 346.74
Total input tokens: 1024000
Total generated tokens: 70328
Request throughput (req/s): 2.88
Output token throughput (tok/s): 202.83
Total Token throughput (tok/s): 3156.09
---------------Time to First Token----------------
Mean TTFT (ms): 172033.53
Median TTFT (ms): 178518.65
P99 TTFT (ms): 326714.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 357.45
Median TPOT (ms): 271.08
P99 TPOT (ms): 1728.97
---------------Inter-token Latency----------------
Mean ITL (ms): 263.52
Median ITL (ms): 151.98
P99 ITL (ms): 1228.35
==================================================
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 346.74
Total input tokens: 1024000
Total generated tokens: 70328
Request throughput (req/s): 2.88
Output token throughput (tok/s): 202.83
Total Token throughput (tok/s): 3156.09
---------------Time to First Token----------------
Mean TTFT (ms): 172033.53
Median TTFT (ms): 178518.65
P99 TTFT (ms): 326714.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 357.45
Median TPOT (ms): 271.08
P99 TPOT (ms): 1728.97
---------------Inter-token Latency----------------
Mean ITL (ms): 263.52
Median ITL (ms): 151.98
P99 ITL (ms): 1228.35
==================================================
will test that
3WaD
3WaD7mo ago
When you initialize the vLLM engine (on cold start) you should see a log similar to this: Maximum concurrency for 32768 tokens per request: 5.42x as a part of vLLM's memory profiling. Make sure that the engine can perform concurrency > 2. That being said, the official RunPod vLLM image, unfortunately, does not handle concurrency dynamically (it's hardcoded to 300 or static value), which will result in bottlenecking the jobs anyway. But it's definitely possible to stream multiple responses concurrently from a single serverless worker. Or at least it's working on my implementation.
riverfog7
riverfog77mo ago
actually that doesnt matter you can batch even if you have less than 2x cocurrency if the requests fit in kv cache anyways he has enough cache (the requests doesnt even use 5percent of the cache) idk why it doesnot work either because everything is right

Did you find this page helpful?