Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving

Hi. I need help with setting up a vllm serverless pod with disaggregated serving and distributed inference for a llama 3.2 3B model. The setup would be a disaggregated setup, something like 1 worker with 8 total GPUs, where 4 GPUs for 1 prefill task and 4 GPUs for 1 decode task. Can experts help me set this up using vllm on runpod serverless ? I am going for this approach as I want super low latency, and I think sharding the model for prefill and decode separately with tensor parallelism will help me achieve this. Additionally, I want to have prefill batch size as 1. And decode batch size = 16.
4 Replies
yhlong00000
yhlong000009mo ago
I haven’t tried this setup before, but given that the model is relatively small, using multiple GPUs might not be beneficial. If the GPUs you’re using aren’t connected via NVLink, the communication overhead between them could actually make it slower than running everything on a single GPU.
Unknown User
Unknown User9mo ago
Message Not Public
Sign In & Join Server To View
cellular-automaton
cellular-automatonOP9mo ago
A high throughput use case can be served with inter-leaved decoding on a single gpu . However I'm interested in a low latency setup. Agreed on the nvlink part. Do you have any guidance on how to set that up on runpod ?
Unknown User
Unknown User9mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?