Runpod•2y ago•

68 replies

Error building worker-vllm docker image for mixtral 8x7b

I'm running the following command to build and tag a docker worker image based off of worker-vllm:

docker build -t lesterhnh/mixtral-8x7b-instruct-v0.1-runpod-serverless:1.0 --build-arg MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1" --build-arg MODEL_BASE_PATH="/models" .

I'm getting the following error:

------
Dockerfile:23
--------------------
22 | # Install torch and vllm based on CUDA version
23 | >>> RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
24 | >>> python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
25 | >>> python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; \
26 | >>> else \
27 | >>> python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
28 | >>> fi && \
29 | >>> rm -rf /root/.cache/pip
30 |
--------------------
ERROR: failed to solve: process "/bin/bash -o pipefail -c if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; else python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; fi && rm -rf /root/.cache/pip" did not complete successfully: exit code: 1

GitHub

GitHub - runpod/vllm-fork-for-sls-worker: A high-throughput and mem...

A high-throughput and memory-efficient inference and serving engine for LLMs - GitHub - runpod/vllm-fork-for-sls-worker: A high-throughput and memory-efficient inference and serving engine for LLMs

Justin Merrell•1/4/24, 3:56 PM

Are you building this on a system that has a GPU?

cc: @Alpay Ariyak

wizardjoeOP•1/4/24, 4:09 PM

Yes - building it on a Windows PC with a 4090. I'm running the command on WSL though (Windows Subsystem for Linux), if that helps

wizardjoeOP•1/4/24, 4:16 PM

Just tried it on regular command prompt, and can confirm that I'm getting the same error

JJustin Merrell Are you building this on a system that has a GPU? cc: @Alpay Ariyak

vaventt•1/4/24, 4:18 PM

Hi Justin, can you help me with few questions, I need to develop and deploy a RAG system based on open-source LLM, I have tried several times, RunPod Serverless A6000/A100, it starts worker and container, than later it can download 50gb of weights or 150gb, but not all 270gb and just stops and restarts to download the weights again and again burning only money, but no real outcome, I just can't deploy LLaMa-70B, RunPod don't gives me a chance, what should I do? Is Cloud GPU option more suitable and stable for production than Serverless?

Vvaventt Hi Justin, can you help me with few questions, I need to develop and deploy a RA...

Justin Merrell•1/4/24, 4:20 PM

If you are downloading weights before the handler starts then your worker is tying out and being remmoved. Ideally you will want to either have weights stored in network storage or have them baked into the worker image.

JJustin Merrell If you are downloading weights before the handler starts then your worker is tyi...

vaventt•1/4/24, 4:29 PM

I have only 1

rp_handler.py

rp_handler.py

file where all the code are located, it starts after my last command in Docker Container, after it triggers handler function, on line 35 my hugging face weights are starting to download:
AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.float16,
quantization_config=quantization_config
)`

Sometimes it works from the first time, but specifficaly with big LLaMa which weights are downloading up to 1hour in my case it without throwing an error stops container, while there is still 1 queue, small Mistral-7b usually works great, but when I take bigger model it just don't work

JJustin Merrell If you are downloading weights before the handler starts then your worker is tyi...

vaventt•1/4/24, 4:29 PM

So network storage for my workers will definitely help and will be a good deployment practice, in terms of using RunPod?

Vvaventt So network storage for my workers will definitely help and will be a good deploy...

Justin Merrell•1/4/24, 4:31 PM

Yes, network storage sounds like what you are missing

wizardjoeOP•1/4/24, 4:32 PM

@Justin , any idea on the error I'm getting?

JJustin Merrell Yes, network storage sounds like what you are missing

vaventt•1/4/24, 4:35 PM

Great, thanks a lot, btw I'm located in Eastern Europe, how to choose the best region for me, to open my network storage? By distance, EU-RO-1 and EU-CZ-1 should be the closest, but maybe some regions have more GPUs in general to choose and to work with?

Alpay Ariyak•1/4/24, 6:01 PM

Will check on this in a few hours

wizardjoeOP•1/4/24, 6:57 PM

Some more detail - it looks like it fails when trying to run setup.py develop for vllm. Ninja is trying to compile and fails.

wizardjoeOP•1/4/24, 6:58 PM

message.txt67.23KB

wizardjoeOP•1/4/24, 9:59 PM

Also confirming that I'm getting the same error when trying to build Llama2-13b. I was able to build and deploy Llama2-13b a month ago, so something must have changed since then

wizardjoeOP•1/4/24, 10:01 PM

I also noticed that github is showing the builds failing on CD

Herai_Studios•1/5/24, 1:11 AM

@Alpay Ariyak @Justin If you guys have any update on this, I would be interested in knowing the outcome as I am facing this issue as well

wizardjoeOP•1/5/24, 1:15 AM

For what it's worth, I can't even get it to work using the pre-built Docker image with environment variables. When I use this method to spin up an endpoint, I'm getting CUDA out of memory errors, even though I selected a 48GB GPU

Herai_Studios•1/5/24, 2:14 AM

just to be clear - it doesn't let you get past the setup.py script for vllm, correct? this is where it breaks for me with the pre-built dockerfile as well

Herai_Studios•1/5/24, 2:15 AM

I built my own where I just added vllm to the requirements.txt file for download and that worked better

Herai_Studios•1/5/24, 2:15 AM

or you can do RUN pip install vllm

wizardjoeOP•1/5/24, 3:51 AM

Yup that's where it breaks for me too

wizardjoeOP•1/5/24, 3:52 AM

Did you have to specify a specific version for vllm?

ashley•1/5/24, 6:17 AM

Don't think you can build it on Github because I believe it now requires the machine you're building on to have a GPU.

Herai_Studios•1/5/24, 2:01 PM

@ashleyk if I'm not mistaken, @wizardjoe mentioned his machine has a GPU and mine also does. The docker image is not working correctly for either of us though and it's breaking at the same point

Alpay Ariyak•1/5/24, 2:42 PM

Working on this

wizardjoeOP•1/5/24, 4:19 PM

@Herai_Studios @ashleyk Yes, I have a 4090

Alpay Ariyak•1/5/24, 4:53 PM

To confirm, you’re also not able to just

sudo docker build .

sudo docker build .

wizardjoeOP•1/5/24, 5:52 PM

@Alpay Ariyak you mean, "sudo docker build ." without tagging and any of the other args?

Wwizardjoe @Alpay Ariyak you mean, "sudo docker build ." without tagging and any of the oth...

Alpay Ariyak•1/5/24, 6:11 PM

Yeah, just without args

Alpay Ariyak•1/5/24, 6:12 PM

Seems to me in general that the issue is not being on something like Ubuntu because I’ve never had issues building on it from different Ubuntu machines

Alpay Ariyak•1/5/24, 6:13 PM

We’re working on a way to allow building with any OS, vllm’s recent updates added the changes that resulted in linux-only installation

wizardjoeOP•1/5/24, 6:14 PM

Trying "sudo docker build ." now

wizardjoeOP•1/5/24, 6:15 PM

Were you able to repro the problem on a Windows box?

Wwizardjoe Trying "sudo docker build ." now

wizardjoeOP•1/5/24, 6:20 PM

Just finished running - it fails as well

wizardjoeOP•1/5/24, 6:22 PM

If it's the case that this won't work on Windows, do you know if it would work on an Ubuntu VM on a windows host running with Hyper-V?

wizardjoeOP•1/6/24, 5:34 PM

Any updates on this? I also tried this on a Debian box I spun up in Google Cloud, but it also fails during the "Running setup.py develop for vllm" step. This time though, it just freezes completely before showing the word "Killed". The machine has an Nvidia L4 gpu with 24gb VRAM and 64 GB RAM

wizardjoeOP•1/6/24, 6:39 PM

FYI... For whoever reads this and is having the same issue, I finally got it working by doing the workaround @Herai_Studios suggested and making a few more changes: 1) You have to remove or comment out lines 25-27 in the Dockerfile, so that Docker doesn't try to compile vllm from source, 2) after line 32, add a new line "RUN pip install vllm", which will install the PyPI version of vllm, since we aren't compiling it anymore, and 3) when running the docker build command, specify WORKER_CUDA_VERSION = 12.1, since there is another issue with the latest version of vllm which won't work with CUDA 11.8.

Wwizardjoe FYI... For whoever reads this and is having the same issue, I finally got it wor...

Herai_Studios•1/6/24, 6:54 PM

nice! I'll add that if you have CUDA 11.8, the reason vllm won't work is because you have to make sure you have the right PyTorch version that uses CUDA 11.8

Herai_Studios•1/6/24, 6:54 PM

so how does it look when it works for you?

wizardjoeOP•1/7/24, 7:06 AM

@Herai_Studios it spends some time downloading the model safetensors, and then after that, it exports the layers and then writes the image. I haven't tested the endpoint yet, will let you know more tomorrow

Concept•1/19/24, 5:52 PM

Any updates on this on it working? Also struggling to use Mixtral 8x7b AWQ with Runpod VLLM worker. I have 32gb of ram and my machine is crashing at the part where its running setup.py develop for vllm where my RAM just skyrockets.

Alpay Ariyak•1/19/24, 6:09 PM

Unfortunately, Linux is an official requirement for vLLM, and WSL wouldn't work either (https://github.com/vllm-project/vllm/issues/1685)
However, we believe we have a workaround that we're still actively testing

GitHub

Plans to make the installation work on Windows without WSL? · Issue...

I get the following error during install: No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\ Due to my company's policy we can't have W...

Concept•1/19/24, 6:09 PM

I’m using Linux and have an Nvidia GPU

CConcept I’m using Linux and have an Nvidia GPU

Alpay Ariyak•1/19/24, 6:12 PM

In that case, try adding this before the vLLM installation and adjust

max_jobs

max_jobs

and

nvcc_threads

nvcc_threads

as needed

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

Concept•1/19/24, 6:16 PM

sudo docker build -t conceptgt/koglyticstream:v1.1 --build-arg MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ" --build-arg MODEL_BASE_PATH="/models" --build-arg QUANTIZATION="awq" --build-arg WORKER_CUDA_VERSION="12.1" .

sudo docker build -t conceptgt/koglyticstream:v1.1 --build-arg MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ" --build-arg MODEL_BASE_PATH="/models" --build-arg QUANTIZATION="awq" --build-arg WORKER_CUDA_VERSION="12.1" .

# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.4-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
    HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
    HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
    TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
    HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
    python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt && \
    rm /requirements.txt

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME=""
ARG MODEL_BASE_PATH="/runpod-volume/"
ARG QUANTIZATION=""

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
    MODEL_NAME=$MODEL_NAME \
    QUANTIZATION=$QUANTIZATION 

RUN --mount=type=secret,id=HF_TOKEN,required=false \
    if [ -f /run/secrets/HF_TOKEN ]; then \
        export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
    fi && \
    if [ -n "$MODEL_NAME" ]; then \
        python3.11 /download_model.py --model $MODEL_NAME; \
    fi

# Start the handler
CMD ["python3.11", "/handler.py"]

# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.4-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
    HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
    HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
    TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
    HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
    python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt && \
    rm /requirements.txt

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME=""
ARG MODEL_BASE_PATH="/runpod-volume/"
ARG QUANTIZATION=""

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
    MODEL_NAME=$MODEL_NAME \
    QUANTIZATION=$QUANTIZATION 

RUN --mount=type=secret,id=HF_TOKEN,required=false \
    if [ -f /run/secrets/HF_TOKEN ]; then \
        export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
    fi && \
    if [ -n "$MODEL_NAME" ]; then \
        python3.11 /download_model.py --model $MODEL_NAME; \
    fi

# Start the handler
CMD ["python3.11", "/handler.py"]

CConcept ```sudo docker build -t conceptgt/koglyticstream:v1.1 --build-arg MODEL_NAME="Th...

Alpay Ariyak•1/19/24, 6:16 PM

Yes, just like that, did it work?

Concept•1/19/24, 6:17 PM

It's building now. Waiting

Alpay Ariyak•1/19/24, 6:18 PM

Sounds good
It might take a while with max_jobs=2 so I'd maybe try starting with something like 75% of default, which is the # of cpus you have and go down if you experience crashes

Concept•1/19/24, 6:19 PM

Gotcha. It was the swap memory being full that caused my machine to crash, not the RAM memory per say.

AAlpay Ariyak Sounds good It might take a while with max_jobs=2 so I'd maybe try starting wit...

Concept•1/19/24, 6:20 PM

num of cores right?

sudo docker build -t conceptgt/koglyticstream:v1.1 --build-arg MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ" --build-arg MODEL_BASE_PATH="/models" --build-arg QUANTIZATION="awq" --build-arg WORKER_CUDA_VERSION="12.1" .

# syntax = docker/dockerfile:1.3 ARG WORKER_CUDA_VERSION=11.8 FROM runpod/base:0.4.4-cuda${WORKER_CUDA_VERSION}.0 as builder ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope # Set Environment Variables ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \ HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \ HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \ TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \ HF_TRANSFER=1 # Install Python dependencies COPY builder/requirements.txt /requirements.txt RUN --mount=type=cache,target=/root/.cache/pip \ python3.11 -m pip install --upgrade pip && \ python3.11 -m pip install --upgrade -r /requirements.txt && \ rm /requirements.txt # max jobs used by Ninja to build extensions ARG max_jobs=2 ENV MAX_JOBS=${max_jobs} # number of threads used by nvcc ARG nvcc_threads=8 ENV NVCC_THREADS=$nvcc_threads # Install torch and vllm based on CUDA version RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \ python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \ python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; \ else \ python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \ fi && \ rm -rf /root/.cache/pip # Add source files COPY src . # Setup for Option 2: Building the Image with the Model included ARG MODEL_NAME="" ARG MODEL_BASE_PATH="/runpod-volume/" ARG QUANTIZATION="" ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \ MODEL_NAME=$MODEL_NAME \ QUANTIZATION=$QUANTIZATION RUN --mount=type=secret,id=HF_TOKEN,required=false \ if [ -f /run/secrets/HF_TOKEN ]; then \ export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \ fi && \ if [ -n "$MODEL_NAME" ]; then \ python3.11 /download_model.py --model $MODEL_NAME; \ fi # Start the handler CMD ["python3.11", "/handler.py"]

Error building worker-vllm docker image for mixtral 8x7b

Similar Threads

Similar Threads

Similar Threads