CUDA error: CUDA-capable device(s) is/are busy or unavailable
I see quite a few jobs fail with this error message:
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable18:38:08CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
This usually happens for all jobs on a worker (I have to terminate the worker).
A retry on another worker completes as expected.
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable18:38:08CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
This usually happens for all jobs on a worker (I have to terminate the worker).
A retry on another worker completes as expected.