R
RunPod•6mo ago
blistick

Slow model loading

Hi all. I have a serverless endpoint designed to run Stable Diffusion inference. It's taking about 12 seconds to load the model (Realistic Vision) into the pipeline (using "StableDiffusionPipeline.from_pretrained") from a RunPod network drive. Is this normal? Is the load time mostly a function of (possibly slow) communications speed between the serverless instance and the network volume? The problem is that I'm loading other models as well, so even if I keep the endpoint active there is still a big delay before inference for a job can even begin, and then of course there's the time for inference itself. The total time is too long to provide a good customer experience. I love the idea of easy scaling using the serverless approach, and cost control, but if I can't improve the speed I may have to use a different approach. Any input on other people's experience and ways to improve model loading time would be greatly appreciated!
7 Replies
justin
justin•6mo ago
@blistick You've hit a VERY common roadblock, actually, and why I don't prefer the network volume. There is an initial hit to load the model from the volume, since there is a network delay. Instead if you're models are not too big, i prefer to pack them into the Docker container themselves, rather than a network storage. But this really depends on how big of models we are talking about as it could be inefficient. Depending how much volume you are getting through your endpoint some additional options I can think about are: 1) Keep your models in memory - not sure how your code runs / executes, but could potentially do like a .loadModel() function into a variable, and move it outside of your handler scope, so that you keep it in memory rather than thrown away every handler call. ^With the above you can combine this with increasing your IDLE timeout to like 5 minutes or something, or however long you feel is good. This way only your first request eats the loading model time + all subsequent requests to the same worker has the model in memory 2) You could, as I said bake the model into your Docker container instead but depends on the size 3) Maybe one option is to compress the models on the network volume, and see how long it takes to copy a model and unzip onto the worker, and then all subsequent calls would be at that location 4) One way to test this is to get a GPU pod, with a network volume in the same region, and just play with all of this is what I like to do. I just copy my handler function, throw it into a jupyter notebook and do shit. Maybe the only one you can't test as much is the function scope variable, that is just my guess as to a potential optimization 5) Generally as a like nice little thing though, I like to split my Dockerfiles into multiple docker files now. a) One Dockerfile just has all of the models and base stuff downloaded. So usually like a: FROM RUNPOD:TEMPLATE INSTALL MODELS And then a second Dockerfile like: b) COPY HANDLER.py into /directory/whatever c) symlink directory and xyz / set whatever ENV variables, so on. that way in the future if your models are static and you gotta rebuild your docker files, you don't need to go through the WHOLE download models again process lol
blistick
blistick•6mo ago
@justin Thanks very much for this useful information. Lots to try! I'm confused about your suggestion to keep the models in memory. I already load them outside of my handler function, but even for an "active" worker, that code isn't executed until a request comes in, at least from what I can tell. Maybe active workers will only execute that code once, so they do stay in memory. I'll do more tests to see. Of course, for cost reasons I'd rather not need to keep an active worker.
justin
justin•6mo ago
I see, hm. Yeah, the worker this is just my theory:
model = loadModel(...)

def handler(event):
model.infer(event.input)
model = loadModel(...)

def handler(event):
model.infer(event.input)
The first time this worker gets executed, will load to memory into the model variable. The second time, if the worker hasn't shut down yet, and still in idle state, is "PROBABLY" i have zero clue, maybe still in memory. I do not know Runpod's architecture. My guess is just if they have a request, they call your handler.py, but is already imported somewhere, so should stay in memory scope https://chat.openai.com/share/77cef5b4-c6ae-45e7-a28c-dc041416008e Yup makes sense, tbh, the method I use tends to be just bake into the Dockerfile. I have an audiogeneneration endpoint, which I was working to split yesterday / testing out on the GPU pod. And I did the split into a base Dockerfile / Split into a Serverless docker file In case you are interested: GPU Pod Version:
# Use the RUNPOD base Pytorch
FROM runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04

WORKDIR /app

# Best practices for minimizing layer size and avoiding cache issues
RUN apt-get update && \
apt-get install -y --no-install-recommends ffmpeg && \
rm -rf /var/lib/apt/lists/* && \
pip install --no-cache-dir torch==2.1.2 torchvision torchaudio xformers audiocraft firebase-rest-api==1.11.0 noisereduce==3.0.0 runpod

COPY preloadModel.py /app/preloadModel.py
COPY handler.py /app/handler.py
COPY firebase_credentials.json /app/firebase_credentials.json
COPY suprepo /app/suprepo

RUN python /app/preloadModel.py
# Use the RUNPOD base Pytorch
FROM runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04

WORKDIR /app

# Best practices for minimizing layer size and avoiding cache issues
RUN apt-get update && \
apt-get install -y --no-install-recommends ffmpeg && \
rm -rf /var/lib/apt/lists/* && \
pip install --no-cache-dir torch==2.1.2 torchvision torchaudio xformers audiocraft firebase-rest-api==1.11.0 noisereduce==3.0.0 runpod

COPY preloadModel.py /app/preloadModel.py
COPY handler.py /app/handler.py
COPY firebase_credentials.json /app/firebase_credentials.json
COPY suprepo /app/suprepo

RUN python /app/preloadModel.py
Serverless version:
# Use the updated base CUDA image
FROM justinwlin/DOCKERFILENAME:1.0

WORKDIR /app
COPY handler.py /app/handler.py
# Set Stop signal and CMD
STOPSIGNAL SIGINT
CMD ["python", "-u", "handler.py"]
# Use the updated base CUDA image
FROM justinwlin/DOCKERFILENAME:1.0

WORKDIR /app
COPY handler.py /app/handler.py
# Set Stop signal and CMD
STOPSIGNAL SIGINT
CMD ["python", "-u", "handler.py"]
Also a nice side bonus of splitting the files is you get a GPU Pod version to test on and a serverless version haha, since the Runpod base templates have everything installed for GPU pods already
blistick
blistick•6mo ago
@justin This all makes sense. I've got some work to do, but at least there's a path forward to better performance. Thanks again! (And Happy Holidays!)
justin
justin•6mo ago
Happy Holidays! Also I have ZERO clue btw what any of this means, xD. but I know SECourses / Kopyl had played with the TensorRT stuff. https://discord.com/channels/912829806415085598/1185336794309468210/1185639798745088081 Even though seems to be for "webUI" maybe whatever the heck TensorRT is, can help also just speed up inference time in general. https://discord.com/channels/912829806415085598/1185336794309468210/1185763104039116890
blistick
blistick•6mo ago
Haha. Well, if you have ZERO clue that must mean I have NEGATIVE clue. I'll check out those resources.
foxhound
foxhound•5mo ago
The main performance bottleneck doesn't stem from moving the models outside the handler or loading them from a network volume. Instead, the issue lies in the initial loading of models into VRAM (Neither memory or disk) before input preprocessing. I have attempted to mitigate the problem by disabling VRAM offloading. However, if the worker goes off, it triggers a complete reinitialization. 😦
Want results from more Discord servers?
Add your server
More Posts