Runpod•9mo ago

Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving

Hi. I need help with setting up a vllm serverless pod with disaggregated serving and distributed inference for a llama 3.2 3B model. The setup would be a disaggregated setup, something like 1 worker with 8 total GPUs, where 4 GPUs for 1 prefill task and 4 GPUs for 1 decode task. Can experts help me set this up using vllm on runpod serverless ? I am going for this approach as I want super low latency, and I think sharding the model for prefill and decode separately with tensor parallelism will help me achieve this. Additionally, I want to have prefill batch size as 1. And decode batch size = 16.

4 Replies

yhlong00000•9mo ago

I haven’t tried this setup before, but given that the model is relatively small, using multiple GPUs might not be beneficial. If the GPUs you’re using aren’t connected via NVLink, the communication overhead between them could actually make it slower than running everything on a single GPU.

Unknown User•9mo ago

Message Not Public

cellular-automatonOP•9mo ago

A high throughput use case can be served with inter-leaved decoding on a single gpu . However I'm interested in a low latency setup. Agreed on the nvlink part. Do you have any guidance on how to set that up on runpod ?

Unknown User•9mo ago

Message Not Public

Gaming

Programming

Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving

Did you find this page helpful?