RunPod•17mo ago

Error building worker-vllm docker image for mixtral 8x7b

I'm running the following command to build and tag a docker worker image based off of worker-vllm: docker build -t lesterhnh/mixtral-8x7b-instruct-v0.1-runpod-serverless:1.0 --build-arg MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1" --build-arg MODEL_BASE_PATH="/models" . I'm getting the following error: ------ Dockerfile:23 -------------------- 22 | # Install torch and vllm based on CUDA version 23 | >>> RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \ 24 | >>> python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \ 25 | >>> python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \ 26 | >>> else \ 27 | >>> python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \ 28 | >>> fi && \ 29 | >>> rm -rf /root/.cache/pip 30 | -------------------- ERROR: failed to solve: process "/bin/bash -o pipefail -c if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; else python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; fi && rm -rf /root/.cache/pip" did not complete successfully: exit code: 1

GitHub

GitHub - runpod/vllm-fork-for-sls-worker: A high-throughput and mem...

A high-throughput and memory-efficient inference and serving engine for LLMs - GitHub - runpod/vllm-fork-for-sls-worker: A high-throughput and memory-efficient inference and serving engine for LLMs

47 Replies

Justin Merrell•17mo ago

Are you building this on a system that has a GPU? cc: @Alpay Ariyak

wizardjoeOP•17mo ago

Yes - building it on a Windows PC with a 4090. I'm running the command on WSL though (Windows Subsystem for Linux), if that helps Just tried it on regular command prompt, and can confirm that I'm getting the same error

vaventt•17mo ago

Hi Justin, can you help me with few questions, I need to develop and deploy a RAG system based on open-source LLM, I have tried several times, RunPod Serverless A6000/A100, it starts worker and container, than later it can download 50gb of weights or 150gb, but not all 270gb and just stops and restarts to download the weights again and again burning only money, but no real outcome, I just can't deploy LLaMa-70B, RunPod don't gives me a chance, what should I do? Is Cloud GPU option more suitable and stable for production than Serverless?

Justin Merrell•17mo ago

If you are downloading weights before the handler starts then your worker is tying out and being remmoved. Ideally you will want to either have weights stored in network storage or have them baked into the worker image.

vaventt•17mo ago

I have only 1 rp_handler.py file where all the code are located, it starts after my last command in Docker Container, after it triggers handler function, on line 35 my hugging face weights are starting to download: AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16, quantization_config=quantization_config )` Sometimes it works from the first time, but specifficaly with big LLaMa which weights are downloading up to 1hour in my case it without throwing an error stops container, while there is still 1 queue, small Mistral-7b usually works great, but when I take bigger model it just don't work So network storage for my workers will definitely help and will be a good deployment practice, in terms of using RunPod?

Justin Merrell•17mo ago

Yes, network storage sounds like what you are missing

wizardjoeOP•17mo ago

@Justin , any idea on the error I'm getting?

vaventt•17mo ago

Great, thanks a lot, btw I'm located in Eastern Europe, how to choose the best region for me, to open my network storage? By distance, EU-RO-1 and EU-CZ-1 should be the closest, but maybe some regions have more GPUs in general to choose and to work with?

Alpay Ariyak•17mo ago

Will check on this in a few hours

wizardjoeOP•17mo ago

Some more detail - it looks like it fails when trying to run setup.py develop for vllm. Ninja is trying to compile and fails.

wizardjoeOP•17mo ago

message.txt

wizardjoeOP•17mo ago

Also confirming that I'm getting the same error when trying to build Llama2-13b. I was able to build and deploy Llama2-13b a month ago, so something must have changed since then I also noticed that github is showing the builds failing on CD

Herai_Studios•17mo ago

@Alpay Ariyak @Justin If you guys have any update on this, I would be interested in knowing the outcome as I am facing this issue as well

wizardjoeOP•17mo ago

For what it's worth, I can't even get it to work using the pre-built Docker image with environment variables. When I use this method to spin up an endpoint, I'm getting CUDA out of memory errors, even though I selected a 48GB GPU

Herai_Studios•17mo ago

just to be clear - it doesn't let you get past the setup.py script for vllm, correct? this is where it breaks for me with the pre-built dockerfile as well I built my own where I just added vllm to the requirements.txt file for download and that worked better or you can do RUN pip install vllm

wizardjoeOP•17mo ago

Yup that's where it breaks for me too Did you have to specify a specific version for vllm?

ashleyk•17mo ago

Don't think you can build it on Github because I believe it now requires the machine you're building on to have a GPU.

Herai_Studios•17mo ago

@ashleyk if I'm not mistaken, @wizardjoe mentioned his machine has a GPU and mine also does. The docker image is not working correctly for either of us though and it's breaking at the same point

Alpay Ariyak•17mo ago

Working on this

wizardjoeOP•17mo ago

@Herai_Studios @ashleyk Yes, I have a 4090

Alpay Ariyak•17mo ago

To confirm, you’re also not able to just sudo docker build . ?

wizardjoeOP•17mo ago

@Alpay Ariyak you mean, "sudo docker build ." without tagging and any of the other args?

Alpay Ariyak•17mo ago

Yeah, just without args Seems to me in general that the issue is not being on something like Ubuntu because I’ve never had issues building on it from different Ubuntu machines We’re working on a way to allow building with any OS, vllm’s recent updates added the changes that resulted in linux-only installation

wizardjoeOP•16mo ago

Trying "sudo docker build ." now Were you able to repro the problem on a Windows box? Just finished running - it fails as well If it's the case that this won't work on Windows, do you know if it would work on an Ubuntu VM on a windows host running with Hyper-V? Any updates on this? I also tried this on a Debian box I spun up in Google Cloud, but it also fails during the "Running setup.py develop for vllm" step. This time though, it just freezes completely before showing the word "Killed". The machine has an Nvidia L4 gpu with 24gb VRAM and 64 GB RAM FYI... For whoever reads this and is having the same issue, I finally got it working by doing the workaround @Herai_Studios suggested and making a few more changes: 1) You have to remove or comment out lines 25-27 in the Dockerfile, so that Docker doesn't try to compile vllm from source, 2) after line 32, add a new line "RUN pip install vllm", which will install the PyPI version of vllm, since we aren't compiling it anymore, and 3) when running the docker build command, specify WORKER_CUDA_VERSION = 12.1, since there is another issue with the latest version of vllm which won't work with CUDA 11.8.

Herai_Studios•16mo ago

nice! I'll add that if you have CUDA 11.8, the reason vllm won't work is because you have to make sure you have the right PyTorch version that uses CUDA 11.8 so how does it look when it works for you?

wizardjoeOP•16mo ago

@Herai_Studios it spends some time downloading the model safetensors, and then after that, it exports the layers and then writes the image. I haven't tested the endpoint yet, will let you know more tomorrow

Concept•16mo ago

Any updates on this on it working? Also struggling to use Mixtral 8x7b AWQ with Runpod VLLM worker. I have 32gb of ram and my machine is crashing at the part where its running setup.py develop for vllm where my RAM just skyrockets.

Alpay Ariyak•16mo ago

Unfortunately, Linux is an official requirement for vLLM, and WSL wouldn't work either (https://github.com/vllm-project/vllm/issues/1685) However, we believe we have a workaround that we're still actively testing

GitHub

Plans to make the installation work on Windows without WSL? · Issue...

I get the following error during install: No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\ Due to my company's policy we can't have W...

Concept•16mo ago

I’m using Linux and have an Nvidia GPU

Alpay Ariyak•16mo ago

In that case, try adding this before the vLLM installation and adjust max_jobs and nvcc_threads as needed

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

Concept•16mo ago

sudo docker build -t conceptgt/koglyticstream:v1.1 --build-arg MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ" --build-arg MODEL_BASE_PATH="/models" --build-arg QUANTIZATION="awq" --build-arg WORKER_CUDA_VERSION="12.1" .

sudo docker build -t conceptgt/koglyticstream:v1.1 --build-arg MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ" --build-arg MODEL_BASE_PATH="/models" --build-arg QUANTIZATION="awq" --build-arg WORKER_CUDA_VERSION="12.1" .

# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.4-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
    HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
    HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
    TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
    HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
    python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt && \
    rm /requirements.txt

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME=""
ARG MODEL_BASE_PATH="/runpod-volume/"
ARG QUANTIZATION=""

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
    MODEL_NAME=$MODEL_NAME \
    QUANTIZATION=$QUANTIZATION 

RUN --mount=type=secret,id=HF_TOKEN,required=false \
    if [ -f /run/secrets/HF_TOKEN ]; then \
        export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
    fi && \
    if [ -n "$MODEL_NAME" ]; then \
        python3.11 /download_model.py --model $MODEL_NAME; \
    fi

# Start the handler
CMD ["python3.11", "/handler.py"]

# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.4-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
    HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
    HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
    TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
    HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
    python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt && \
    rm /requirements.txt

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME=""
ARG MODEL_BASE_PATH="/runpod-volume/"
ARG QUANTIZATION=""

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
    MODEL_NAME=$MODEL_NAME \
    QUANTIZATION=$QUANTIZATION 

RUN --mount=type=secret,id=HF_TOKEN,required=false \
    if [ -f /run/secrets/HF_TOKEN ]; then \
        export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
    fi && \
    if [ -n "$MODEL_NAME" ]; then \
        python3.11 /download_model.py --model $MODEL_NAME; \
    fi

# Start the handler
CMD ["python3.11", "/handler.py"]

Alpay Ariyak•16mo ago

Yes, just like that, did it work?

Concept•16mo ago

It's building now. Waiting

Alpay Ariyak•16mo ago

Sounds good It might take a while with max_jobs=2 so I'd maybe try starting with something like 75% of default, which is the # of cpus you have and go down if you experience crashes

Concept•16mo ago

Gotcha. It was the swap memory being full that caused my machine to crash, not the RAM memory per say. num of cores right? Yep it def got past the erroring part. downloading tensors now

Alpay Ariyak•16mo ago

Nice! Keep me posted

Concept•16mo ago

Got it built and pushed. Loading it into an endpoint and seeing what happens. CUDA OOM with a 24GB GPU Trying with a 48GB to see if it fixes 2024-01-19T18:58:40.556447675Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacty of 23.68 GiB of which 39.62 MiB is free. Process 3426835 has 23.63 GiB memory in use. Of the allocated memory 23.22 GiB is allocated by PyTorch, and 33.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Concept•16mo ago

@Alpay Ariyak For some reason it doesn't accept the chat template for conversation history.

message.txt

Alpay Ariyak•16mo ago

The mixtral you're using is a base model, so it doesn't have a chat template Mixtral Instruct would have one

Concept•16mo ago

Thank you. 2024-01-19T20:28:00.200082421Z INFO 01-19 20:28:00 llm_engine.py:70] Initializing an LLM engine with config: model='TheBloke/mixtral-8x7b-v0.1-AWQ', tokenizer='TheBloke/mixtral-8x7b-v0.1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/models', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0) This log is taking the most time. I'm stuck here for about 2-3minutes

Concept•16mo ago

Concept•16mo ago

So the reason why I'm trying to use Mixtral is the use of experts and also its context window. I'm open to using OpenChat, would it be possible to increase the context size from 8k or is that set? @Justin

Alpay Ariyak•16mo ago

Mixtral is fine but you need the instruct version of it if you want to have a chat template. Otherwise, you can put your text input as prompt

Alpay Ariyak•16mo ago

Mixtral Instruct: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

mistralai/Mixtral-8x7B-Instruct-v0.1 · Hugging Face

Alpay Ariyak•16mo ago

The mixtral you’re using is a completion model, not an instruction or chat model, so it doesn’t have a template

Concept•16mo ago

Will look into it thank you.

Alpay Ariyak•16mo ago

That’s the model loading stage

Gaming

Programming

Error building worker-vllm docker image for mixtral 8x7b

Did you find this page helpful?