Error building worker-vllm docker image for mixtral 8x7b

I'm running the following command to build and tag a docker worker image based off of worker-vllm: docker build -t lesterhnh/mixtral-8x7b-instruct-v0.1-runpod-serverless:1.0 --build-arg MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1" --build-arg MODEL_BASE_PATH="/models" . I'm getting the following error: ------ Dockerfile:23 -------------------- 22 | # Install torch and vllm based on CUDA version 23 | >>> RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \ 24 | >>> python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \ 25 | >>> python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; \ 26 | >>> else \ 27 | >>> python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \ 28 | >>> fi && \ 29 | >>> rm -rf /root/.cache/pip 30 | -------------------- ERROR: failed to solve: process "/bin/bash -o pipefail -c if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; else python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; fi && rm -rf /root/.cache/pip" did not complete successfully: exit code: 1
GitHub
GitHub - runpod/vllm-fork-for-sls-worker: A high-throughput and mem...
A high-throughput and memory-efficient inference and serving engine for LLMs - GitHub - runpod/vllm-fork-for-sls-worker: A high-throughput and memory-efficient inference and serving engine for LLMs
47 Replies
Justin Merrell
Justin Merrell5mo ago
Are you building this on a system that has a GPU? cc: @Alpay Ariyak
wizardjoe
wizardjoe5mo ago
Yes - building it on a Windows PC with a 4090. I'm running the command on WSL though (Windows Subsystem for Linux), if that helps Just tried it on regular command prompt, and can confirm that I'm getting the same error
vaventt
vaventt5mo ago
Hi Justin, can you help me with few questions, I need to develop and deploy a RAG system based on open-source LLM, I have tried several times, RunPod Serverless A6000/A100, it starts worker and container, than later it can download 50gb of weights or 150gb, but not all 270gb and just stops and restarts to download the weights again and again burning only money, but no real outcome, I just can't deploy LLaMa-70B, RunPod don't gives me a chance, what should I do? Is Cloud GPU option more suitable and stable for production than Serverless?
Justin Merrell
Justin Merrell5mo ago
If you are downloading weights before the handler starts then your worker is tying out and being remmoved. Ideally you will want to either have weights stored in network storage or have them baked into the worker image.
vaventt
vaventt5mo ago
I have only 1 rp_handler.py file where all the code are located, it starts after my last command in Docker Container, after it triggers handler function, on line 35 my hugging face weights are starting to download: AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16, quantization_config=quantization_config )` Sometimes it works from the first time, but specifficaly with big LLaMa which weights are downloading up to 1hour in my case it without throwing an error stops container, while there is still 1 queue, small Mistral-7b usually works great, but when I take bigger model it just don't work So network storage for my workers will definitely help and will be a good deployment practice, in terms of using RunPod?
Justin Merrell
Justin Merrell5mo ago
Yes, network storage sounds like what you are missing
wizardjoe
wizardjoe5mo ago
@Justin , any idea on the error I'm getting?
vaventt
vaventt5mo ago
Great, thanks a lot, btw I'm located in Eastern Europe, how to choose the best region for me, to open my network storage? By distance, EU-RO-1 and EU-CZ-1 should be the closest, but maybe some regions have more GPUs in general to choose and to work with?
Alpay Ariyak
Alpay Ariyak5mo ago
Will check on this in a few hours
wizardjoe
wizardjoe5mo ago
Some more detail - it looks like it fails when trying to run setup.py develop for vllm. Ninja is trying to compile and fails.
wizardjoe
wizardjoe5mo ago
Also confirming that I'm getting the same error when trying to build Llama2-13b. I was able to build and deploy Llama2-13b a month ago, so something must have changed since then I also noticed that github is showing the builds failing on CD
Herai_Studios
Herai_Studios5mo ago
@Alpay Ariyak @Justin If you guys have any update on this, I would be interested in knowing the outcome as I am facing this issue as well
wizardjoe
wizardjoe5mo ago
For what it's worth, I can't even get it to work using the pre-built Docker image with environment variables. When I use this method to spin up an endpoint, I'm getting CUDA out of memory errors, even though I selected a 48GB GPU
Herai_Studios
Herai_Studios5mo ago
just to be clear - it doesn't let you get past the setup.py script for vllm, correct? this is where it breaks for me with the pre-built dockerfile as well I built my own where I just added vllm to the requirements.txt file for download and that worked better or you can do RUN pip install vllm
wizardjoe
wizardjoe5mo ago
Yup that's where it breaks for me too Did you have to specify a specific version for vllm?
ashleyk
ashleyk5mo ago
Don't think you can build it on Github because I believe it now requires the machine you're building on to have a GPU.
Herai_Studios
Herai_Studios5mo ago
@ashleyk if I'm not mistaken, @wizardjoe mentioned his machine has a GPU and mine also does. The docker image is not working correctly for either of us though and it's breaking at the same point
Alpay Ariyak
Alpay Ariyak5mo ago
Working on this
wizardjoe
wizardjoe5mo ago
@Herai_Studios @ashleyk Yes, I have a 4090
Alpay Ariyak
Alpay Ariyak5mo ago
To confirm, you’re also not able to just sudo docker build . ?
wizardjoe
wizardjoe5mo ago
@Alpay Ariyak you mean, "sudo docker build ." without tagging and any of the other args?
Alpay Ariyak
Alpay Ariyak5mo ago
Yeah, just without args Seems to me in general that the issue is not being on something like Ubuntu because I’ve never had issues building on it from different Ubuntu machines We’re working on a way to allow building with any OS, vllm’s recent updates added the changes that resulted in linux-only installation
wizardjoe
wizardjoe5mo ago
Trying "sudo docker build ." now Were you able to repro the problem on a Windows box? Just finished running - it fails as well If it's the case that this won't work on Windows, do you know if it would work on an Ubuntu VM on a windows host running with Hyper-V? Any updates on this? I also tried this on a Debian box I spun up in Google Cloud, but it also fails during the "Running setup.py develop for vllm" step. This time though, it just freezes completely before showing the word "Killed". The machine has an Nvidia L4 gpu with 24gb VRAM and 64 GB RAM FYI... For whoever reads this and is having the same issue, I finally got it working by doing the workaround @Herai_Studios suggested and making a few more changes: 1) You have to remove or comment out lines 25-27 in the Dockerfile, so that Docker doesn't try to compile vllm from source, 2) after line 32, add a new line "RUN pip install vllm", which will install the PyPI version of vllm, since we aren't compiling it anymore, and 3) when running the docker build command, specify WORKER_CUDA_VERSION = 12.1, since there is another issue with the latest version of vllm which won't work with CUDA 11.8.
Herai_Studios
Herai_Studios5mo ago
nice! I'll add that if you have CUDA 11.8, the reason vllm won't work is because you have to make sure you have the right PyTorch version that uses CUDA 11.8 so how does it look when it works for you?
wizardjoe
wizardjoe5mo ago
@Herai_Studios it spends some time downloading the model safetensors, and then after that, it exports the layers and then writes the image. I haven't tested the endpoint yet, will let you know more tomorrow
Concept
Concept5mo ago
Any updates on this on it working? Also struggling to use Mixtral 8x7b AWQ with Runpod VLLM worker. I have 32gb of ram and my machine is crashing at the part where its running setup.py develop for vllm where my RAM just skyrockets.
Alpay Ariyak
Alpay Ariyak5mo ago
Unfortunately, Linux is an official requirement for vLLM, and WSL wouldn't work either (https://github.com/vllm-project/vllm/issues/1685) However, we believe we have a workaround that we're still actively testing
GitHub
Plans to make the installation work on Windows without WSL? · Issue...
I get the following error during install: No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\ Due to my company's policy we can't have W...
Concept
Concept5mo ago
I’m using Linux and have an Nvidia GPU
Alpay Ariyak
Alpay Ariyak5mo ago
In that case, try adding this before the vLLM installation and adjust max_jobs and nvcc_threads as needed
# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads
# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads
Concept
Concept5mo ago
sudo docker build -t conceptgt/koglyticstream:v1.1 --build-arg MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ" --build-arg MODEL_BASE_PATH="/models" --build-arg QUANTIZATION="awq" --build-arg WORKER_CUDA_VERSION="12.1" .
sudo docker build -t conceptgt/koglyticstream:v1.1 --build-arg MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ" --build-arg MODEL_BASE_PATH="/models" --build-arg QUANTIZATION="awq" --build-arg WORKER_CUDA_VERSION="12.1" .
# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.4-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3.11 -m pip install --upgrade pip && \
python3.11 -m pip install --upgrade -r /requirements.txt && \
rm /requirements.txt

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; \
else \
python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
fi && \
rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME=""
ARG MODEL_BASE_PATH="/runpod-volume/"
ARG QUANTIZATION=""

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
MODEL_NAME=$MODEL_NAME \
QUANTIZATION=$QUANTIZATION

RUN --mount=type=secret,id=HF_TOKEN,required=false \
if [ -f /run/secrets/HF_TOKEN ]; then \
export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
fi && \
if [ -n "$MODEL_NAME" ]; then \
python3.11 /download_model.py --model $MODEL_NAME; \
fi

# Start the handler
CMD ["python3.11", "/handler.py"]
# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.4-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3.11 -m pip install --upgrade pip && \
python3.11 -m pip install --upgrade -r /requirements.txt && \
rm /requirements.txt

# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git@cuda-11.8#egg=vllm; \
else \
python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
fi && \
rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME=""
ARG MODEL_BASE_PATH="/runpod-volume/"
ARG QUANTIZATION=""

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
MODEL_NAME=$MODEL_NAME \
QUANTIZATION=$QUANTIZATION

RUN --mount=type=secret,id=HF_TOKEN,required=false \
if [ -f /run/secrets/HF_TOKEN ]; then \
export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
fi && \
if [ -n "$MODEL_NAME" ]; then \
python3.11 /download_model.py --model $MODEL_NAME; \
fi

# Start the handler
CMD ["python3.11", "/handler.py"]
Alpay Ariyak
Alpay Ariyak5mo ago
Yes, just like that, did it work?
Concept
Concept5mo ago
It's building now. Waiting
Alpay Ariyak
Alpay Ariyak5mo ago
Sounds good It might take a while with max_jobs=2 so I'd maybe try starting with something like 75% of default, which is the # of cpus you have and go down if you experience crashes
Concept
Concept5mo ago
Gotcha. It was the swap memory being full that caused my machine to crash, not the RAM memory per say. num of cores right? Yep it def got past the erroring part. downloading tensors now
Alpay Ariyak
Alpay Ariyak5mo ago
Nice! Keep me posted
Concept
Concept5mo ago
Got it built and pushed. Loading it into an endpoint and seeing what happens. CUDA OOM with a 24GB GPU Trying with a 48GB to see if it fixes 2024-01-19T18:58:40.556447675Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacty of 23.68 GiB of which 39.62 MiB is free. Process 3426835 has 23.63 GiB memory in use. Of the allocated memory 23.22 GiB is allocated by PyTorch, and 33.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Concept
Concept5mo ago
@Alpay Ariyak For some reason it doesn't accept the chat template for conversation history.
Alpay Ariyak
Alpay Ariyak5mo ago
The mixtral you're using is a base model, so it doesn't have a chat template Mixtral Instruct would have one
Concept
Concept5mo ago
Thank you. 2024-01-19T20:28:00.200082421Z INFO 01-19 20:28:00 llm_engine.py:70] Initializing an LLM engine with config: model='TheBloke/mixtral-8x7b-v0.1-AWQ', tokenizer='TheBloke/mixtral-8x7b-v0.1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/models', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0) This log is taking the most time. I'm stuck here for about 2-3minutes
Concept
Concept5mo ago
No description
Concept
Concept5mo ago
So the reason why I'm trying to use Mixtral is the use of experts and also its context window. I'm open to using OpenChat, would it be possible to increase the context size from 8k or is that set? @Justin
Alpay Ariyak
Alpay Ariyak5mo ago
Mixtral is fine but you need the instruct version of it if you want to have a chat template. Otherwise, you can put your text input as prompt
Alpay Ariyak
Alpay Ariyak5mo ago
The mixtral you’re using is a completion model, not an instruction or chat model, so it doesn’t have a template
Concept
Concept5mo ago
Will look into it thank you.
Alpay Ariyak
Alpay Ariyak5mo ago
That’s the model loading stage