RunpodR
Runpod2y ago
Siamak

Mutli GPU

I was conducting an experiment to run LoRAX (https://github.com/predibase/lorax) on multiple GPUs. However, I did not observe any improvement in the results; in fact, the throughput was even worse.


For sequence calls, the throughput for 1x GPU is better than 2x GPU!


Code for sequence calls:

def tgi_server(prompt): headers = {'Content-Type': 'application/json'} url = f'.../generate' data = { "inputs": prompt, "parameters": { "max_new_tokens": 1000, "temperature": 1.0, "top_p": 0.99, "do_sample":False, "seed": 42 } } response = requests.post(url, json=data, headers=headers) # print(response.status_code) res = response.json() # print(res) # print(response.status_code) return res if __name__ == '__main__': for index, sample in enumerate(input_sample_data): input_text = '...' input_str = f'"""{input_text}"""' template = f"""[INST] <<SYS>> ... <</SYS>> {input_str}[/INST]""" print("starting on {}".format(InsightSourceId)) s0 = time() # print(template) response = tgi_server(template) s1 = time() # print(response) response = response["generated_text"]

I asked this question from LoRAX team, and they mentioned:


This isn't surprising if your GPUs are connected via PCIe. Unless you're using NVLink, the network overhead of GPU-to-GPU communication will, in most cases, be the bottleneck for inference.

The main situations where you would want to use multi-GPU would be:

When the model is too large to fit on a single GPU
When your GPUs are connected by NVLink

If neither condition is met, you're definitely better off on a single GPU.


I am using 2x L40 on Runpod
GitHub
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs - predibase/lorax
Was this page helpful?