Problem with RunPod cuda base image. Jobs stuck in queue forever

Hello, I'm trying to do a request to a serverless endpoint that uses this base image on its Dockerfile FROM runpod/base:0.4.0-cuda11.8.0 I want the serverside to run the input_fn function when I do the request. This is part of the server side code:
model = model_fn('/app/src/tapnet/checkpoints/')
runpod.serverless.start({"handler": input_fn})
model = model_fn('/app/src/tapnet/checkpoints/')
runpod.serverless.start({"handler": input_fn})
If I use the cuda base image it does not run input_fn, I only see the debug prints from model_fn and then the job stays in queue forever (photo). The thing is that if I use this base image: FROM python:3.11.1-buster It does run both input_fn and model_fn So my questions are: - Why is the problem happening in the cuda base image? - What are the implications of using the 2nd base image? Are there cuda or pytorch dependencies missing here? - What base image should I use? What do I do?
No description
Solution:
Message Not Public
Sign In & Join Server To View
Jump to solution
12 Replies
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
galakurpismo3
galakurpismo3OP2y ago
FROM runpod/base:0.4.0-cuda11.8.0 FROM python:3.11.1-buster Python dependencies COPY builder/requirements.txt /requirements.txt RUN python3.11 -m pip install --upgrade pip && \ python3.11 -m pip install --upgrade -r /requirements.txt --no-cache-dir && \ rm /requirements.txt Add src files (Worker Template) COPY src /app/src Ensure the checkpoints directory exists and copy the checkpoint file RUN mkdir -p /app/src/tapnet/checkpoints COPY src/tapnet/checkpoints/bootstapir_checkpoint.pt /app/src/tapnet/checkpoints/bootstapir_checkpoint.pt Set working directory WORKDIR /app Set AWS credentials. DEBUG, luego poner en env o ENV AWS_ACCESS_KEY_ID=... ENV AWS_SECRET_ACCESS_KEY=... ENV AW... ENV PYTHONPATH=/app CMD ["python3.11", "-u", "src/inference.py"] if i use the cuda image it is not running, if i use the other image, it runs, it gets the video and everything sorry for the bad format on the dockerfile, but its just the typical thing i guess
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
Madiator2011
Madiator20112y ago
would need to see error message
galakurpismo3
galakurpismo3OP2y ago
there are no errors really, its just that input_fn isnt running where can i find a link or something to that?
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
galakurpismo3
galakurpismo3OP2y ago
okay I'll try doing that, I guess that using python:3.11.1-buster won't work right?
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
galakurpismo3
galakurpismo3OP2y ago
it works with that one, meaning that it gets inside input_fn, but there are going to be dependencies missing or something to run the GPU
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
galakurpismo3
galakurpismo3OP2y ago
yeah, okay, I'll try both things, thank you so much
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?