Distributing model across multiple GPUs using vLLM
vLLM has parameter TENSOR_PARALLEL_SIZE to distribute model across multiple GPUs but is this parameter supported in serverless vLLM template? I tried setting it but the inference time was the same for model running on single GPU vs multiple GPUs
7 Replies
cc: @Alpay Ariyak
You don't need it, as it's automatically set to the number of GPUs of the worker
Unknown User•17mo ago
Message Not Public
Sign In & Join Server To View
Yeah that’s a vllm issue, it doesn’t allow 6 or 10
vLLM specifically says 64 / (GPU Count) must have no modulus.
So, 1 , 2, 4, 8, 16, 32, and 64.
Unknown User•17mo ago
Message Not Public
Sign In & Join Server To View
It does. I blame vLLM.