Runpod•17mo ago

Distributing model across multiple GPUs using vLLM

vLLM has parameter TENSOR_PARALLEL_SIZE to distribute model across multiple GPUs but is this parameter supported in serverless vLLM template? I tried setting it but the inference time was the same for model running on single GPU vs multiple GPUs

7 Replies

haris•17mo ago

cc: @Alpay Ariyak

Alpay Ariyak•17mo ago

You don't need it, as it's automatically set to the number of GPUs of the worker

Unknown User•17mo ago

Message Not Public

Alpay Ariyak•17mo ago

Yeah that’s a vllm issue, it doesn’t allow 6 or 10

Charixfox•17mo ago

vLLM specifically says 64 / (GPU Count) must have no modulus. So, 1 , 2, 4, 8, 16, 32, and 64.

Unknown User•17mo ago

Message Not Public

Charixfox•17mo ago

It does. I blame vLLM.

Gaming

Programming

Distributing model across multiple GPUs using vLLM

Did you find this page helpful?