RunPod4mo ago

Mutli GPU

I was conducting an experiment to run LoRAX (https://github.com/predibase/lorax) on multiple GPUs. However, I did not observe any improvement in the results; in fact, the throughput was even worse. For sequence calls, the throughput for 1x GPU is better than 2x GPU! Code for sequence calls: def tgi_server(prompt): headers = {'Content-Type': 'application/json'} url = f'.../generate' data = { "inputs": prompt, "parameters": { "max_new_tokens": 1000, "temperature": 1.0, "top_p": 0.99, "do_sample":False, "seed": 42 } } response = requests.post(url, json=data, headers=headers) # print(response.status_code) res = response.json() # print(res) # print(response.status_code) return res if __name__ == '__main__': for index, sample in enumerate(input_sample_data): input_text = '...' input_str = f'"""{input_text}"""' template = f"""[INST] <<SYS>> ... <</SYS>> {input_str}[/INST]""" print("starting on {}".format(InsightSourceId)) s0 = time() # print(template) response = tgi_server(template) s1 = time() # print(response) response = response["generated_text"] I asked this question from LoRAX team, and they mentioned:
This isn't surprising if your GPUs are connected via PCIe. Unless you're using NVLink, the network overhead of GPU-to-GPU communication will, in most cases, be the bottleneck for inference. The main situations where you would want to use multi-GPU would be: When the model is too large to fit on a single GPU When your GPUs are connected by NVLink If neither condition is met, you're definitely better off on a single GPU.
I am using 2x L40 on Runpod
GitHub - predibase/lorax: Multi-LoRA inference server that scales t...
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs - predibase/lorax
8 Replies
ashleyk4mo ago
You need to ask this question on the Github repo not here.
Siamak4mo ago
@ashleyk , The question from runpod is whether your GPUs are connected by NVlink or PCIe? Is there any github for runpod to ask this question there?!
ashleyk4mo ago
Ah yeah would have been better just to ask that. @flash-singh can probably answer this.
Siamak4mo ago
@flash-singh Could you please help me?
ashleyk4mo ago
He is in the US so probably have to wait a few hours for him to come online.
flash-singh4mo ago
L40s are only pcie
Siamak4mo ago
@flash-singh RTX4090s are connected via NVLink? because I have same issue on RTX4090 as well Could you please mention, which GPU type are connected via NVLink?
flash-singh4mo ago
only nvlink or fast interconnects is SXM, A100 or H100