Runpod•7mo ago

SGLang DeepSeek-V3-0324

I have been trying to run Deepseek-V3-0324 using instant clusters with 2 x (8 x H100s) and have so far been unsuccessful. I am trying to get the model to run multi-node + multi-gpu. I have downloaded the model from Huggingface onto a persistent and attach the persistent volume to my instant cluster before launching. After launching, I then run the Pytorch demo script as presented in https://docs.runpod.io/instant-clusters/pytorch to make sure that the network is working (it does). I then follow the instructions to get Deepseek-V3-0324 running according to: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3 Instead of following the absolute default instructions and doing:

# node 1
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

# node 1
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

In its place, I run the following command on each node:

python3 -m sglang.launch_server --model-path DeepSeek-V3-0324 --tp 16 --dist-init-addr ${MASTER_ADDR}:${MASTER_PORT} --nnodes ${NUM_NODES} --node-rank ${NODE_RANK} --trust-remote-code

python3 -m sglang.launch_server --model-path DeepSeek-V3-0324 --tp 16 --dist-init-addr ${MASTER_ADDR}:${MASTER_PORT} --nnodes ${NUM_NODES} --node-rank ${NODE_RANK} --trust-remote-code

The issue is that this hangs. I check nvidia-smi to see the model loading and it only ever loads each GPU up to almost 1GB before it goes up no further. Any help would be greatly appreciated.

Deploy with PyTorch | RunPod Documentation

Learn how to deploy an Instant Cluster and run a multi-node process using PyTorch.

GitHub

sglang/benchmark/deepseek_v3 at main · sgl-project/sglang

SGLang is a fast serving framework for large language models and vision language models. - sgl-project/sglang

130 Replies

riverfog7•7mo ago

Before discussing this problem Does a 600+ parameter fp16 model fit in 16xH100s? With reasonable context length? Or is it fp8? Idk Anyways why are you doing tensor parallel over network

Unknown User•7mo ago

Message Not Public

riverfog7•7mo ago

Its tensor parallel tho Over network you should pipeline parallel Wow u r rich lol Apparently ots fp8 in the official repo So it should work

riverfog7•7mo ago

https://www.naddod.com/blog/tensor-parallelism?srsltid=AfmBOor49ZgjIS_hjk9dWAmdVUDmjia5fEk7X1E18SsXs4voHbwh0dFI

Tensor Parallelism - NADDOD Blog

Tensor parallelism alleviates memory issues in large-scale training. RoCE enables efficient communication for GPU tensor parallelism, accelerating computations.

riverfog7•7mo ago

Maybe becuz its cmd Try bash -c 'command'

Unknown User•7mo ago

Message Not Public

frogsbodyOP•7mo ago

Yeah, I just want to run this thing. I'm happy to spend on GPUs for a period of time to get it running. But I can't even get the basics to work unfortunately... has anyone seen any example on any infrastructure setup of this working multi-node / pipeline parallelism? If not on RunPod than anywhere else? It seems that no one has got this running anywhere.

Unknown User•7mo ago

Message Not Public

Poddy•7mo ago

@frogsbody

Escalated To Zendesk

The thread has been escalated to Zendesk!

Unknown User•7mo ago

Message Not Public

frogsbodyOP•7mo ago

Yes, I have exactly this issue.

frogsbodyOP•7mo ago

They claim to support pipeline parallelism:

frogsbodyOP•7mo ago

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3 from this url

GitHub

sglang/benchmark/deepseek_v3 at main · sgl-project/sglang

SGLang is a fast serving framework for large language models and vision language models. - sgl-project/sglang

Unknown User•7mo ago

Message Not Public

frogsbodyOP•7mo ago

nnodes = 2 It’s a torchrun argument that gets passed through to the equivalent in SGLang I believe.

Unknown User•7mo ago

Message Not Public

frogsbodyOP•7mo ago

@Jason have you tried vLLM with instant clusters? I believe the communication mechanism under the hood doesn't work with the way that Runpod sets up inter-node communication. I couldn't get it to work (this was a few weeks ago when it was still in beta though). I wasn't sure where to open a ticket because I'm not sure where the error is really coming from... I think it's an SGLang issue but I wasn't clear.

riverfog7•7mo ago

are you still here? ive never used sglang but vllm pipeline parallel works well with multi node even with not. that good network bandwidth

frogsbodyOP•7mo ago

I'm still trying @riverfog7. Have you tested vLLM with Instant Cluster or do you have another solution where I can test multi-gpu in the cloud to run this?

riverfog7•7mo ago

@frogsbody but do you really need multigpu?

frogsbodyOP•7mo ago

Yeah, I specifically need to test tensor parallelism and pipeline parallelism: 2 nodes of 8 x H100

riverfog7•7mo ago

i mean you can host the same model in 1 node you want deepseek v3 at fp8 right?

frogsbodyOP•7mo ago

My requirements are to run Deepseek-V3-0324 over two nodes by whatever means - I just have to see that pipeline and tensor parallelism can work for the model

riverfog7•7mo ago

okay is sglang required too?

frogsbodyOP•7mo ago

It's less about actually using it - more about showing it can work No, it can be anything vLLM would be fine too

riverfog7•7mo ago

vllm should work ive done it in the past

frogsbodyOP•7mo ago

Have you got that working in Runpod Instant Clusters?

riverfog7•7mo ago

not with 2x8 but that doesnt matter no in AWS should work. with runpod tho

frogsbodyOP•7mo ago

I had trouble running the basic torchrun script: https://docs.runpod.io/instant-clusters/pytorch

Deploy with PyTorch | RunPod Documentation

Learn how to deploy an Instant Cluster and run a multi-node process using PyTorch.

frogsbodyOP•7mo ago

It failed to run it when I tried with VLLM Yeah, I tried AWS but they wouldn't give me any GPUs so now just trying with runpod I will try vllm again

riverfog7•7mo ago

can you ping the other pod

frogsbodyOP•7mo ago

Yeah I can ping it

riverfog7•7mo ago

with ip

frogsbodyOP•7mo ago

It's something to do with Ray, which vLLM uses under the hood

riverfog7•7mo ago

do u use vllm docker image or sth else?

frogsbodyOP•7mo ago

I was using something else - but I can use that docker image

riverfog7•7mo ago

i suceededd with the docker image soo

frogsbodyOP•7mo ago

Thanks for letting me know, I'll try that out and let you know how it goes

riverfog7•7mo ago

@frogsbody https://docs.vllm.ai/en/latest/serving/distributed_serving.html here's the multinode docs

riverfog7•7mo ago

https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh

GitHub

vllm/examples/online_serving/run_cluster.sh at main · vllm-project...

A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm

riverfog7•7mo ago

the cluster making script docker run \ --entrypoint /bin/bash \ --network host \ --name node \ --shm-size 10.24g \ --gpus all \ -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \ "${ADDITIONAL_ARGS[@]}" \ "${DOCKER_IMAGE}" -c "${RAY_START_CMD}" this is the docker run command so you can modify this and run it

frogsbodyOP•7mo ago

Where do I run this? When I create the pod?

riverfog7•7mo ago

no so waht the docs says is you have two physical machines then you create a ray container on both physical machines and make a cluster. but in your case you have no access to physical machines

frogsbodyOP•7mo ago

Yeah, the issue I had is that runpod doesn't have that I have to work within those bounds I don't have AWS or anything to work with

riverfog7•7mo ago

so you should translate the docker run command to runpod's template docker run --entrypoint /bin/bash --network host --name node --shm-size 10.24g --gpus all -v /path/to/the/huggingface/home/in/this/node:/root/.cache/huggingface -e VLLM_HOST_IP=ip_of_this_node vllm/vllm-openai -c ray start --block --address=ip_of_head_node:6379

frogsbodyOP•7mo ago

Yeah, this was my next idea - I just have to figure out how to modify runpod to work with this since "docker" can't be run inside of a pod once I start it It has to be part of a template or something. I'm pretty new to this side of Runpod.

riverfog7•7mo ago

should be image name: vllm/vllm-openai CMD: python3 -m vllm.entrypoints.openai.api_server -c ray start --block --address=ip_of_head_node:6379 mount nw volume to ~/.cache/ for the worker image name: vllm/vllm-openai CMD: python3 -m vllm.entrypoints.openai.api_server -c ray start --block --head --port=6379 env: VLLM_HOST_IP=ip_of_this_node for the head

frogsbodyOP•7mo ago

Are you starting these as two separate pods or using Instant Cluster? In this case it looks like you're using two separate pods with global networking or something

riverfog7•7mo ago

can you apply diff images

frogsbodyOP•7mo ago

Not with instant cluster

riverfog7•7mo ago

for the two pods in clusters? or diff setting at least

frogsbodyOP•7mo ago

Doesn't look likeit

riverfog7•7mo ago

lol should we write a script? its solvable

frogsbodyOP•7mo ago

Yeah I'd love to lol, been trying to run DeepSeek V3 across two nodes for a while now vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 8 \ --pipeline-parallel-size 2 I was thinking this should just work If I spin up a cluster And go into each node and run this And make sure to pass in the right information for the host... But then it fails because of Ray

riverfog7•7mo ago

yeah but we need a script to build the ray cluster first it should run INSIDE a ray cluster

frogsbodyOP•7mo ago

Yeah, I feel like that's outside the default scope of instant clusters. Are you suggesting we set up a ray cluster inside of our non-ray cluster?

riverfog7•7mo ago

that's what i meant 😄 the writing a script part was for that

frogsbodyOP•7mo ago

That would be cool... solve a lot of problems lol

riverfog7•7mo ago

ill try with global networking first just to see if it works

frogsbodyOP•7mo ago

I'll try form a ray cluster in Instant Cluster again

riverfog7•7mo ago

try python3 -m vllm.entrypoints.openai.api_server -c ray start --block --address=ip_of_head_node:6379 this inside a vllm container

frogsbodyOP•7mo ago

Sure will try now

riverfog7•7mo ago

so you have seperate ssh access to the two nodes right?

frogsbodyOP•7mo ago

Yes

riverfog7•7mo ago

good

frogsbodyOP•7mo ago

Will send sc in a sec api_server.py: error: argument --block-size: expected one argument

riverfog7•7mo ago

isnt it --block?

frogsbodyOP•7mo ago

I copied what you sent and that's what it gave me In my case:

python3 -m vllm.entrypoints.openai.api_server -c ray start --block --address=10.65.0.2:6379

python3 -m vllm.entrypoints.openai.api_server -c ray start --block --address=10.65.0.2:6379

riverfog7•7mo ago

python3 -m vllm.entrypoints.openai.api_server -c ray start --block --address=ip_of_head_node:6379 this?

frogsbodyOP•7mo ago

Yeah, that's what I did ^

riverfog7•7mo ago

hnm vllm/vllm-openai the image is this

frogsbodyOP•7mo ago

Yep

riverfog7•7mo ago

i think il ltest in mine first wait a sec oh it was just ray start --block --address=10.65.0.2:6379 or bash -c "ray start ...."

frogsbodyOP•7mo ago

Yeah, I'm doing that right now actually

riverfog7•7mo ago

didnt see the --entrypoint /bin/bash

frogsbodyOP•7mo ago

But can't get the worker to connect

riverfog7•7mo ago

Unknown User•7mo ago

Message Not Public

frogsbodyOP•7mo ago

riverfog7•7mo ago

wdym?

Unknown User•7mo ago

Message Not Public

frogsbodyOP•7mo ago

Tried 6379 and couldn't get that to work

Unknown User•7mo ago

Message Not Public

frogsbodyOP•7mo ago

So tried a port I knew was exposed 29400 since I can ping between the nodes with that

riverfog7•7mo ago

ray start --block --address=10.65.0.2:6379 this?

frogsbodyOP•7mo ago

This works

riverfog7•7mo ago

is adding --block make a diff?

frogsbodyOP•7mo ago

But connecting from worker doesn't Let me try with block

riverfog7•7mo ago

ray start --block --head --port=6379 for the head ray start --block --address=10.65.0.2:6379 for the worker

frogsbodyOP•7mo ago

riverfog7•7mo ago

maybe it doesnt work bc u already started a vllm process in the start command

frogsbodyOP•7mo ago

I actually haven't started anything in this one I am not using the vLLM image this time. I started a new cluster that doesn't have vLLM. I pip installed it. Regardless, Ray should work independently

riverfog7•7mo ago

yeah same thought but had nothing to blame other than that check ufw just in case

frogsbodyOP•7mo ago

What is UFW?

riverfog7•7mo ago

and other firewalls too ubuntu firewall

frogsbodyOP•7mo ago

Ah okay let me check

riverfog7•7mo ago

that was the problem in my last attempt @frogsbody i have one question

frogsbodyOP•7mo ago

Trying to check but have to install packages @riverfog7 yeah what's up?

riverfog7•7mo ago

why does the first pic say 172.xx but second pic says10.60.sth

frogsbodyOP•7mo ago

That's the "master address"

riverfog7•7mo ago

master addr?

frogsbodyOP•7mo ago

https://docs.runpod.io/instant-clusters/

Overview | RunPod Documentation

Instant Clusters enable high-performance computing across multiple GPUs with high-speed networking capabilities.

frogsbodyOP•7mo ago

NODE_ADDR is the address of the individual node That's the one that ray uses vLLM uses Ray under the hood and it isn't playing nicely That's why I was hoping SGLang would work since it uses pytorch But then we have that weird bug where it hangs model loading at 1% lol I suspect that it's actually only loading the pytorch stuff And never actually loads any of the weights in

riverfog7•7mo ago

maybe it binds to the wrong nic?

frogsbodyOP•7mo ago

We use eth1 I think here

riverfog7•7mo ago

and recieves from the public ip but not from private ip

frogsbodyOP•7mo ago

Issue is that I'm not sure if that's something we can even fix under the hood with vLLM... I just don't know enough about how vLLM works

riverfog7•7mo ago

if ray works vllm works

frogsbodyOP•7mo ago

And vLLM uses that same ray cluster?

riverfog7•7mo ago

yeah

frogsbodyOP•7mo ago

Hmm

riverfog7•7mo ago

you can just use the ray cluster as one computer vllm does the finicky things by itself

frogsbodyOP•7mo ago

I actually can't even ping between each machine now

riverfog7•7mo ago

maybe ray start --block --head --port 6379 --node-ip-address 10.65.0.2 in the head? wut

frogsbodyOP•7mo ago

riverfog7•7mo ago

can u just ping it ping 10.sth

frogsbodyOP•7mo ago

riverfog7•7mo ago

ufw status?

frogsbodyOP•7mo ago

Interesting My environment is messed up now I can't run the default torch script here

frogsbodyOP•7mo ago

https://docs.runpod.io/instant-clusters/pytorch

Deploy with PyTorch | RunPod Documentation

Learn how to deploy an Instant Cluster and run a multi-node process using PyTorch.

riverfog7•7mo ago

lol

frogsbodyOP•7mo ago

So I messed something up with whatever we tried

riverfog7•7mo ago

hmm maybe start with a fresh pytorch image

frogsbodyOP•7mo ago

riverfog7•7mo ago

and install everything (ray and vllm)

frogsbodyOP•7mo ago

I'm going to have to refund this account lol, it won't let me start another pod Not enough money in the account lol I may sleep for a bit and get back to this, interseting problem to solve

riverfog7•7mo ago

great the community says --node-ip-address providing this should make it bind to the proper address so maybe try that next time in the head node

frogsbodyOP•7mo ago

Yeah will do, I'll post any findings here

Gaming

Programming

SGLang DeepSeek-V3-0324

Did you find this page helpful?