R
Runpod3mo ago
Leo

Serverless worker won't even start but counts as running

Hi so lately I've been dealing with various issues regarding serverless workers (i am using a custom docker file). At first i used the official base image with cuda 11.8 (and an older version of pytorch) and it worked fine with the 3090 but not with the 5090 (i have two serverless endpoints, one with "lower end gpus" and the other one with "higher end gpus"). So i used the base image with the latest version of everything (pytorch 2.7.1, cuda 12.9, ubuntu 24.04) but for some reason now the 3090 pod didn't work and the 5090 pod gave error CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected. I tried to tweak the docker image a bit but no success. Then i made the docker image install the package nvidia-cuda-toolkit and now, on the 5090 pod i get error CUDA error: no kernel image is available for execution on the device. As for now, I am using runpod/pytorch:0.7.0-ubuntu2404-cu1290-torch271 as base image (the latest of everything) and nothing would even start, the machines count as running but the queue still remains and I can't even see the logs! On the 5090 endpoint, I have also set the allows cuda versions to 12.8 and 12.9 but it didn't make a difference
7 Replies
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Leo
LeoOP3mo ago
yeah i did all of that the worker works fine if i use as base image (for the 3090) runpod/pytorch:0.7.0-dev-cu1290-torch271-ubuntu2404 but as u can guess it doesen't with the 5090 i just changed teh image to runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 and tried it on the 5090, also added some debug logging on my py file
# PyTorch version
print("[INFO] PyTorch version:", torch.__version__)

# CUDA version used to build PyTorch
print("PyTorch version:", torch.__version__)
print("Has torch.version:", hasattr(torch, 'version'))
print("torch.version.cuda:", getattr(torch.version, 'cuda', 'Not available')) # type: ignore
# CUDA availability and current device info
print("[INFO] Is CUDA available:", torch.cuda.is_available())

if torch.cuda.is_available():
print("[INFO] CUDA device name:", torch.cuda.get_device_name(0))
print("[INFO] CUDA runtime version (driver):", torch._C._cuda_getCompiledVersion())
# PyTorch version
print("[INFO] PyTorch version:", torch.__version__)

# CUDA version used to build PyTorch
print("PyTorch version:", torch.__version__)
print("Has torch.version:", hasattr(torch, 'version'))
print("torch.version.cuda:", getattr(torch.version, 'cuda', 'Not available')) # type: ignore
# CUDA availability and current device info
print("[INFO] Is CUDA available:", torch.cuda.is_available())

if torch.cuda.is_available():
print("[INFO] CUDA device name:", torch.cuda.get_device_name(0))
print("[INFO] CUDA runtime version (driver):", torch._C._cuda_getCompiledVersion())
and on runpod, the outpus was
[INFO] CUDA runtime version (driver): 12080\n
[INFO] CUDA device name: NVIDIA GeForce RTX 5090\n
[INFO] Is CUDA available: True\n
torch.version.cuda: 12.8\n
Has torch.version: True\n
PyTorch version: 2.7.1+cu128\n
[INFO] PyTorch version: 2.7.1+cu128\n
[INFO] CUDA runtime version (driver): 12080\n
[INFO] CUDA device name: NVIDIA GeForce RTX 5090\n
[INFO] Is CUDA available: True\n
torch.version.cuda: 12.8\n
Has torch.version: True\n
PyTorch version: 2.7.1+cu128\n
[INFO] PyTorch version: 2.7.1+cu128\n
so apparently it can see cuda and the 5090, but right after, when the actual script starts, it crashes with error CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Poddy
Poddy3mo ago
@Leo
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #21135
Leo
LeoOP3mo ago
5090 id: 8w5k90ve05nww1 (the last one i just checked) 3090 id: rkupfkqc4mndu9 the problem seems to generate due to ffmpeg (with command ffmpeg -hwaccel cuda -c:v h264_cuvid -ss 15 -i "/videos/sample.mp4" -t 15 -q:v 2 -vf "fps=2" -start_number 31 -y "/output/frames/sample_frame_%05d.jpg", trying to extract frames from a video). I tried in a dedicated pod and get the same error as for now
Dj
Dj3mo ago
Hey again, ffmpeg is a little fragile and actually. depends very specifically on the exact driver version of a given machine
Leo
LeoOP3mo ago
Yeah in the end i managed to get it working without gpu acceleration. Not what I wanted as it is slower especially with large videos but its the only working solution

Did you find this page helpful?