Anton
RRunPod
•Created by Anton on 4/8/2025 in #⚡|serverless
Serverless endpoint fails with CUDA error
Hi! I have a WhisperX image, here is the repo that I used for the image: https://github.com/kodxana/whisperx-worker. One of the runpod workers for that image started throwing CUDA failed with the error CUDA capable device is a busy or unavailable error in response to every request. Once this worker was restarted, everything was fixed.
The problem is that it failed 18 requests before we found the error and fixed it. It is the first time this error has happened.
Is there a way to properly set up notifications if the worker fails a lot of requests, or maybe restart the worker if a few requests fail?
And what could cause this error?
9 replies