Serverless worker won't even start but counts as running
Hi so lately I've been dealing with various issues regarding serverless workers (i am using a custom docker file).
At first i used the official base image with cuda 11.8 (and an older version of pytorch) and it worked fine with the 3090 but not with the 5090 (i have two serverless endpoints, one with "lower end gpus" and the other one with "higher end gpus").
So i used the base image with the latest version of everything (pytorch 2.7.1, cuda 12.9, ubuntu 24.04) but for some reason now the 3090 pod didn't work and the 5090 pod gave error
CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected.
I tried to tweak the docker image a bit but no success.
Then i made the docker image install the package nvidia-cuda-toolkit and now, on the 5090 pod i get error CUDA error: no kernel image is available for execution on the device.
As for now, I am using runpod/pytorch:0.7.0-ubuntu2404-cu1290-torch271 as base image (the latest of everything) and nothing would even start, the machines count as running but the queue still remains and I can't even see the logs!
On the 5090 endpoint, I have also set the allows cuda versions to 12.8 and 12.9 but it didn't make a difference7 Replies
Unknown User•3mo ago
Message Not Public
Sign In & Join Server To View
yeah i did all of that
the worker works fine if i use as base image (for the 3090)
runpod/pytorch:0.7.0-dev-cu1290-torch271-ubuntu2404
but as u can guess it doesen't with the 5090
i just changed teh image to runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 and tried it on the 5090, also added some debug logging on my py file
and on runpod, the outpus was
so apparently it can see cuda and the 5090, but right after, when the actual script starts, it crashes with error CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detectedUnknown User•3mo ago
Message Not Public
Sign In & Join Server To View
@Leo
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #21135
5090 id:
8w5k90ve05nww1 (the last one i just checked)
3090 id: rkupfkqc4mndu9
the problem seems to generate due to ffmpeg (with command ffmpeg -hwaccel cuda -c:v h264_cuvid -ss 15 -i "/videos/sample.mp4" -t 15 -q:v 2 -vf "fps=2" -start_number 31 -y "/output/frames/sample_frame_%05d.jpg", trying to extract frames from a video). I tried in a dedicated pod and get the same error
as for nowHey again, ffmpeg is a little fragile and actually. depends very specifically on the exact driver version of a given machine
Yeah in the end i managed to get it working without gpu acceleration. Not what I wanted as it is slower especially with large videos but its the only working solution