Problem with RunPod cuda base image. Jobs stuck in queue forever
Hello, I'm trying to do a request to a serverless endpoint that uses this base image on its Dockerfile
FROM runpod/base:0.4.0-cuda11.8.0
I want the serverside to run the input_fn function when I do the request. This is part of the server side code:
If I use the cuda base image it does not run input_fn, I only see the debug prints from model_fn and then the job stays in queue forever (photo).
The thing is that if I use this base image:
FROM python:3.11.1-buster
It does run both input_fn and model_fn
So my questions are:
- Why is the problem happening in the cuda base image?
- What are the implications of using the 2nd base image? Are there cuda or pytorch dependencies missing here?
- What base image should I use? What do I do?
12 Replies
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
FROM runpod/base:0.4.0-cuda11.8.0
FROM python:3.11.1-buster
Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN python3.11 -m pip install --upgrade pip && \
python3.11 -m pip install --upgrade -r /requirements.txt --no-cache-dir && \
rm /requirements.txt
Add src files (Worker Template)
COPY src /app/src
Ensure the checkpoints directory exists and copy the checkpoint file
RUN mkdir -p /app/src/tapnet/checkpoints
COPY src/tapnet/checkpoints/bootstapir_checkpoint.pt /app/src/tapnet/checkpoints/bootstapir_checkpoint.pt
Set working directory
WORKDIR /app
Set AWS credentials. DEBUG, luego poner en env o
ENV AWS_ACCESS_KEY_ID=...
ENV AWS_SECRET_ACCESS_KEY=...
ENV AW...
ENV PYTHONPATH=/app
CMD ["python3.11", "-u", "src/inference.py"]
if i use the cuda image it is not running, if i use the other image, it runs, it gets the video and everything
sorry for the bad format on the dockerfile, but its just the typical thing i guess
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
would need to see error message
there are no errors really, its just that input_fn isnt running
where can i find a link or something to that?
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
okay I'll try doing that, I guess that using
python:3.11.1-buster
won't work right?
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
it works with that one, meaning that it gets inside input_fn, but there are going to be dependencies missing or something to run the GPU
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
yeah, okay, I'll try both things, thank you so much
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View