I've got an endpoint that filters to only use GPUs with cuda 13.0. I've also got a worker image that's built upon cu130/torch-2.10 pytorch wheel. The workers are mostly running fine, but around 10% of jobs are failing due to a worker failing to initialize CUDA. What can I do to avoid these bad workers? And hopefully you can look into solving this - must be some kind of driver mismatch on your GPUs.
Example bad worker id: 7ilvjrl2rf1c9e
Error log (I've attached the rest):
2026-02-19 01:55:26.878 | info | 7ilvjrl2rf1c9e | RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.