Serverless worker won't even start but counts as running
Hi so lately I've been dealing with various issues regarding serverless workers (i am using a custom docker file).
At first i used the official base image with cuda 11.8 (and an older version of pytorch) and it worked fine with the 3090 but not with the 5090 (i have two serverless endpoints, one with "lower end gpus" and the other one with "higher end gpus").
So i used the base image with the latest version of everything (pytorch 2.7.1, cuda 12.9, ubuntu 24.04) but for some reason now the 3090 pod didn't work and the 5090 pod gave error
I tried to tweak the docker image a bit but no success.
Then i made the docker image install the package
As for now, I am using
On the 5090 endpoint, I have also set the allows cuda versions to 12.8 and 12.9 but it didn't make a difference
At first i used the official base image with cuda 11.8 (and an older version of pytorch) and it worked fine with the 3090 but not with the 5090 (i have two serverless endpoints, one with "lower end gpus" and the other one with "higher end gpus").
So i used the base image with the latest version of everything (pytorch 2.7.1, cuda 12.9, ubuntu 24.04) but for some reason now the 3090 pod didn't work and the 5090 pod gave error
CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected. I tried to tweak the docker image a bit but no success.
Then i made the docker image install the package
nvidia-cuda-toolkit and now, on the 5090 pod i get error CUDA error: no kernel image is available for execution on the device.As for now, I am using
runpod/pytorch:0.7.0-ubuntu2404-cu1290-torch271 as base image (the latest of everything) and nothing would even start, the machines count as running but the queue still remains and I can't even see the logs!On the 5090 endpoint, I have also set the allows cuda versions to 12.8 and 12.9 but it didn't make a difference