Problem with RunPod cuda base image. Jobs stuck in queue forever

Hello, I'm trying to do a request to a serverless endpoint that uses this base image on its Dockerfile FROM runpod/base:0.4.0-cuda11.8.0 I want the serverside to run the input_fn function when I do the request. This is part of the server side code:
model = model_fn('/app/src/tapnet/checkpoints/')
runpod.serverless.start({"handler": input_fn})
model = model_fn('/app/src/tapnet/checkpoints/')
runpod.serverless.start({"handler": input_fn})
If I use the cuda base image it does not run input_fn, I only see the debug prints from model_fn and then the job stays in queue forever (photo). The thing is that if I use this base image: FROM python:3.11.1-buster It does run both input_fn and model_fn So my questions are: - Why is the problem happening in the cuda base image? - What are the implications of using the 2nd base image? Are there cuda or pytorch dependencies missing here? - What base image should I use? What do I do?
No description
Solution:
Hmm yeah I guess python 3.11 is missing from that runpod base image..
Jump to solution
13 Replies
nerdylive
nerdylive2mo ago
I have no problem with the Cuda base image, how's your dockerfile? How did you run the python script Is input fn called when it's running?
galakurpismo3
galakurpismo32mo ago
FROM runpod/base:0.4.0-cuda11.8.0 FROM python:3.11.1-buster Python dependencies COPY builder/requirements.txt /requirements.txt RUN python3.11 -m pip install --upgrade pip && \ python3.11 -m pip install --upgrade -r /requirements.txt --no-cache-dir && \ rm /requirements.txt Add src files (Worker Template) COPY src /app/src Ensure the checkpoints directory exists and copy the checkpoint file RUN mkdir -p /app/src/tapnet/checkpoints COPY src/tapnet/checkpoints/bootstapir_checkpoint.pt /app/src/tapnet/checkpoints/bootstapir_checkpoint.pt Set working directory WORKDIR /app Set AWS credentials. DEBUG, luego poner en env o ENV AWS_ACCESS_KEY_ID=... ENV AWS_SECRET_ACCESS_KEY=... ENV AW... ENV PYTHONPATH=/app CMD ["python3.11", "-u", "src/inference.py"] if i use the cuda image it is not running, if i use the other image, it runs, it gets the video and everything sorry for the bad format on the dockerfile, but its just the typical thing i guess
nerdylive
nerdylive2mo ago
Hmm try the Cuda image from ngc
Madiator2011
Madiator20112mo ago
would need to see error message
galakurpismo3
galakurpismo32mo ago
there are no errors really, its just that input_fn isnt running where can i find a link or something to that?
nerdylive
nerdylive2mo ago
I'm not sure what's wrong there but id suggest use other image if it's problematic Search Google, Nvidia ngc It's nvidia's domain Also i think python 3.11isnt installed on Cuda 11 img
galakurpismo3
galakurpismo32mo ago
okay I'll try doing that, I guess that using python:3.11.1-buster won't work right?
nerdylive
nerdylive2mo ago
Wait I thought it works no? What template does it work with
galakurpismo3
galakurpismo32mo ago
it works with that one, meaning that it gets inside input_fn, but there are going to be dependencies missing or something to run the GPU
Solution
nerdylive
nerdylive2mo ago
Hmm yeah I guess python 3.11 is missing from that runpod base image..
nerdylive
nerdylive2mo ago
You just have to install them or use templates from NGC
galakurpismo3
galakurpismo32mo ago
yeah, okay, I'll try both things, thank you so much
nerdylive
nerdylive2mo ago
Np lmk how it goes