R
Runpod7mo ago
Teddy

vLLM and multiple GPUs

Hi, I am trying to deploy a model (LLM) of 3B in Runpod with vLLM. I have tried different configurations (4xL4 or 2xL40, etc) but in all I get a CUDA memory error, as if both GPUs are not sharing memory. I have tried pipeline-parallel-size and tensor-parallel-size but I still get the same error.
8 Replies
Dj
Dj7mo ago
Some machines don't have the required technology for using multiple GPUs (NCCL) enabled. But the error for that should be super straightforward, this is how I personally setup vLLM with multiple GPUs.
model = LLM(
model="mistralai/Ministral-8B-Instruct-2410",
tokenizer_mode="mistral",
config_format="mistral",
load_format="mistral",
tensor_parallel_size=int(os.environ.get("RUNPOD_GPU_COUNT", "1")),
)
model = LLM(
model="mistralai/Ministral-8B-Instruct-2410",
tokenizer_mode="mistral",
config_format="mistral",
load_format="mistral",
tensor_parallel_size=int(os.environ.get("RUNPOD_GPU_COUNT", "1")),
)
This uses the $RUNPOD_GPU_COUNT variable we set for you, its the amount of GPUs you selected. If it's not set - like it wouldn't be on your localhost - it just uses 1.
Teddy
TeddyOP7mo ago
Any way to know if the machines have the NCCL?
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
Teddy
TeddyOP7mo ago
So, I need to reserve for example 4xL4 and then check it. If not, I should contact directly with support team? It would be nice to inform before reservations
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View
yhlong00000
yhlong000007mo ago
Usually, if a model can fit on a single GPU, it’s best to use just one. Using multiple GPUs adds overhead for splitting and aggregating the workload.
Teddy
TeddyOP7mo ago
And what do you recommend to get, at least, 4.000 rpm? Serve the model in different independents gpus directly?
Unknown User
Unknown User7mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?