vLLM and multiple GPUs
Hi, I am trying to deploy a model (LLM) of 3B in Runpod with vLLM. I have tried different configurations (4xL4 or 2xL40, etc) but in all I get a CUDA memory error, as if both GPUs are not sharing memory. I have tried pipeline-parallel-size and tensor-parallel-size but I still get the same error.
8 Replies
Some machines don't have the required technology for using multiple GPUs (NCCL) enabled. But the error for that should be super straightforward, this is how I personally setup vLLM with multiple GPUs.
This uses the $RUNPOD_GPU_COUNT variable we set for you, its the amount of GPUs you selected. If it's not set - like it wouldn't be on your localhost - it just uses 1.
Any way to know if the machines have the NCCL?
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
So, I need to reserve for example 4xL4 and then check it. If not, I should contact directly with support team?
It would be nice to inform before reservations
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Usually, if a model can fit on a single GPU, it’s best to use just one. Using multiple GPUs adds overhead for splitting and aggregating the workload.
And what do you recommend to get, at least, 4.000 rpm?
Serve the model in different independents gpus directly?
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View