R
Runpod17mo ago
octopus

Distributing model across multiple GPUs using vLLM

vLLM has parameter TENSOR_PARALLEL_SIZE to distribute model across multiple GPUs but is this parameter supported in serverless vLLM template? I tried setting it but the inference time was the same for model running on single GPU vs multiple GPUs
7 Replies
haris
haris17mo ago
cc: @Alpay Ariyak
Alpay Ariyak
Alpay Ariyak17mo ago
You don't need it, as it's automatically set to the number of GPUs of the worker
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Alpay Ariyak
Alpay Ariyak17mo ago
Yeah that’s a vllm issue, it doesn’t allow 6 or 10
Charixfox
Charixfox17mo ago
vLLM specifically says 64 / (GPU Count) must have no modulus. So, 1 , 2, 4, 8, 16, 32, and 64.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Charixfox
Charixfox17mo ago
It does. I blame vLLM.

Did you find this page helpful?