CUDA error: CUDA-capable device(s) is/are busy or unavailable
I see quite a few jobs fail with this error message: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable18:38:08CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
This usually happens for all jobs on a worker (I have to terminate the worker). A retry on another worker completes as expected.
No replies yet
Join the Discord to continue the conversation
R
Runpod
We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!