R
RunPod•4w ago
Anton

Serverless endpoint fails with CUDA error

Hi! I have a WhisperX image, here is the repo that I used for the image: https://github.com/kodxana/whisperx-worker. One of the runpod workers for that image started throwing CUDA failed with the error CUDA capable device is a busy or unavailable error in response to every request. Once this worker was restarted, everything was fixed. The problem is that it failed 18 requests before we found the error and fixed it. It is the first time this error has happened. Is there a way to properly set up notifications if the worker fails a lot of requests, or maybe restart the worker if a few requests fail? And what could cause this error?
Solution:
then set minimal cuda version to 12.4
Jump to solution
5 Replies
Jason
Jason•4w ago
try to edit your endpoint
Solution
Jason
Jason•4w ago
then set minimal cuda version to 12.4
Jason
Jason•4w ago
scroll down to see this
No description
Anton
AntonOP•4w ago
Thanks! I set it up. So, as far as I see, it is possible that once a few new CUDA versions are released, I'll have to double-check that there are GPUs that are not available for the selected versions and remove the unhealthy version in case there are? And what about notifications for such cases? For some reason, the worker wasn't unhealthy; it only started trotting after 1 hour of returning errors, so I'd like to set up a notification if there are a lot of fails and, if possible, without writing a custom microservice for that 😅
Jason
Jason•4w ago
Just use a proper handling in your worker's container so that you can send a notification if you Want to And this, if you don't update your container (not push / change your image tag) , apps inside usually stays that version

Did you find this page helpful?