RuntimeError: CUDA unknown error

I've got an endpoint that filters to only use GPUs with cuda 13.0. I've also got a worker image that's built upon cu130/torch-2.10 pytorch wheel. The workers are mostly running fine, but around 10% of jobs are failing due to a worker failing to initialize CUDA. What can I do to avoid these bad workers? And hopefully you can look into solving this - must be some kind of driver mismatch on your GPUs.

Example bad worker id: 7ilvjrl2rf1c9e

Error log (I've attached the rest):

2026-02-19 01:55:26.878 | info | 7ilvjrl2rf1c9e | RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

error-logs.txt9.56KB

Runpod•2mo ago•

8 replies

Robi

RuntimeError: CUDA unknown error

error-logs.txt9.56KB

RuntimeError: CUDA unknown error

RuntimeError: CUDA unknown error

Similar Threads