Runpod•10mo ago

vLLM and multiple GPUs

Hi, I am trying to deploy a model (LLM) of 3B in Runpod with vLLM. I have tried different configurations (4xL4 or 2xL40, etc) but in all I get a CUDA memory error, as if both GPUs are not sharing memory. I have tried pipeline-parallel-size and tensor-parallel-size but I still get the same error.

Dj•3/19/25, 5:12 PM

Some machines don't have the required technology for using multiple GPUs (NCCL) enabled. But the error for that should be super straightforward, this is how I personally setup vLLM with multiple GPUs.

model = LLM(
    model="mistralai/Ministral-8B-Instruct-2410",
    tokenizer_mode="mistral",
    config_format="mistral",
    load_format="mistral",
    tensor_parallel_size=int(os.environ.get("RUNPOD_GPU_COUNT", "1")),
)

model = LLM(
    model="mistralai/Ministral-8B-Instruct-2410",
    tokenizer_mode="mistral",
    config_format="mistral",
    load_format="mistral",
    tensor_parallel_size=int(os.environ.get("RUNPOD_GPU_COUNT", "1")),
)

Dj•3/19/25, 5:12 PM

This uses the $RUNPOD_GPU_COUNT variable we set for you, its the amount of GPUs you selected. If it's not set - like it wouldn't be on your localhost - it just uses 1.

TeddyOP•3/20/25, 7:50 AM

Any way to know if the machines have the NCCL?

Jason•3/20/25, 11:07 AM

if there's no way to check from the nvidia sdk inside the docker container, then you gotta ask support for it

TeddyOP•3/20/25, 11:19 AM

So, I need to reserve for example 4xL4 and then check it. If not, I should contact directly with support team?

TeddyOP•3/20/25, 11:19 AM

It would be nice to inform before reservations

Jason•3/20/25, 1:08 PM

no, i mean you need to figure out a way once you can reuse the same way to check other pod im guessing (if there is a feature from nvidia's apps), but if there is no other way than getting more permissions then you should contact runpod to check manually

Jason•3/20/25, 1:08 PM

or you can ask support team directly about the method of checking

yhlong00000•3/20/25, 6:10 PM

Usually, if a model can fit on a single GPU, it’s best to use just one. Using multiple GPUs adds overhead for splitting and aggregating the workload.

TeddyOP•3/20/25, 8:41 PM

And what do you recommend to get, at least, 4.000 rpm?

TeddyOP•3/20/25, 8:41 PM

Serve the model in different independents gpus directly?

TTeddy And what do you recommend to get, at least, 4.000 rpm?

Jason•3/21/25, 2:41 AM

well test it out, for your own case it might different with other's use

Jason•3/21/25, 2:41 AM

its the best way of knowing by testing on your own

TTeddy Serve the model in different independents gpus directly?

Jason•3/21/25, 2:42 AM

sure that works, or just one strong, big vram gpu, or multiple gpu splits in vllm

model = LLM( model="mistralai/Ministral-8B-Instruct-2410", tokenizer_mode="mistral", config_format="mistral", load_format="mistral", tensor_parallel_size=int(os.environ.get("RUNPOD_GPU_COUNT", "1")), )

vLLM and multiple GPUs

Similar Threads

Similar Threads

Similar Threads