Runpod•3mo ago

Serverless worker won't even start but counts as running

Hi so lately I've been dealing with various issues regarding serverless workers (i am using a custom docker file). At first i used the official base image with cuda 11.8 (and an older version of pytorch) and it worked fine with the 3090 but not with the 5090 (i have two serverless endpoints, one with "lower end gpus" and the other one with "higher end gpus"). So i used the base image with the latest version of everything (pytorch 2.7.1, cuda 12.9, ubuntu 24.04) but for some reason now the 3090 pod didn't work and the 5090 pod gave error CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected. I tried to tweak the docker image a bit but no success. Then i made the docker image install the package nvidia-cuda-toolkit and now, on the 5090 pod i get error CUDA error: no kernel image is available for execution on the device. As for now, I am using runpod/pytorch:0.7.0-ubuntu2404-cu1290-torch271 as base image (the latest of everything) and nothing would even start, the machines count as running but the queue still remains and I can't even see the logs! On the 5090 endpoint, I have also set the allows cuda versions to 12.8 and 12.9 but it didn't make a difference

7 Replies

Unknown User•3mo ago

Message Not Public

LeoOP•3mo ago

yeah i did all of that the worker works fine if i use as base image (for the 3090) runpod/pytorch:0.7.0-dev-cu1290-torch271-ubuntu2404 but as u can guess it doesen't with the 5090 i just changed teh image to runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 and tried it on the 5090, also added some debug logging on my py file

  # PyTorch version
  print("[INFO] PyTorch version:", torch.__version__)

  # CUDA version used to build PyTorch
  print("PyTorch version:", torch.__version__)
  print("Has torch.version:", hasattr(torch, 'version'))
  print("torch.version.cuda:", getattr(torch.version, 'cuda', 'Not available')) # type: ignore
  # CUDA availability and current device info
  print("[INFO] Is CUDA available:", torch.cuda.is_available())

  if torch.cuda.is_available():
    print("[INFO] CUDA device name:", torch.cuda.get_device_name(0))
    print("[INFO] CUDA runtime version (driver):", torch._C._cuda_getCompiledVersion())

  # PyTorch version
  print("[INFO] PyTorch version:", torch.__version__)

  # CUDA version used to build PyTorch
  print("PyTorch version:", torch.__version__)
  print("Has torch.version:", hasattr(torch, 'version'))
  print("torch.version.cuda:", getattr(torch.version, 'cuda', 'Not available')) # type: ignore
  # CUDA availability and current device info
  print("[INFO] Is CUDA available:", torch.cuda.is_available())

  if torch.cuda.is_available():
    print("[INFO] CUDA device name:", torch.cuda.get_device_name(0))
    print("[INFO] CUDA runtime version (driver):", torch._C._cuda_getCompiledVersion())

and on runpod, the outpus was

[INFO] CUDA runtime version (driver): 12080\n
[INFO] CUDA device name: NVIDIA GeForce RTX 5090\n
[INFO] Is CUDA available: True\n
torch.version.cuda: 12.8\n
Has torch.version: True\n
PyTorch version: 2.7.1+cu128\n
[INFO] PyTorch version: 2.7.1+cu128\n

[INFO] CUDA runtime version (driver): 12080\n
[INFO] CUDA device name: NVIDIA GeForce RTX 5090\n
[INFO] Is CUDA available: True\n
torch.version.cuda: 12.8\n
Has torch.version: True\n
PyTorch version: 2.7.1+cu128\n
[INFO] PyTorch version: 2.7.1+cu128\n

so apparently it can see cuda and the 5090, but right after, when the actual script starts, it crashes with error CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Unknown User•3mo ago

Message Not Public

Poddy•3mo ago

@Leo

Escalated To Zendesk

The thread has been escalated to Zendesk!

Ticket ID: #21135

LeoOP•3mo ago

5090 id: 8w5k90ve05nww1 (the last one i just checked) 3090 id: rkupfkqc4mndu9 the problem seems to generate due to ffmpeg (with command

ffmpeg -hwaccel cuda -c:v h264_cuvid -ss 15 -i "/videos/sample.mp4" -t 15 -q:v 2 -vf "fps=2" -start_number 31 -y "/output/frames/sample_frame_%05d.jpg"

, trying to extract frames from a video). I tried in a dedicated pod and get the same error as for now

Dj•3mo ago

Hey again, ffmpeg is a little fragile and actually. depends very specifically on the exact driver version of a given machine

LeoOP•3mo ago

Yeah in the end i managed to get it working without gpu acceleration. Not what I wanted as it is slower especially with large videos but its the only working solution

Gaming

Programming

Serverless worker won't even start but counts as running

Did you find this page helpful?