Anton
Anton
RRunPod
Created by Anton on 4/8/2025 in #⚡|serverless
Serverless endpoint fails with CUDA error
Thanks! I set it up. So, as far as I see, it is possible that once a few new CUDA versions are released, I'll have to double-check that there are GPUs that are not available for the selected versions and remove the unhealthy version in case there are? And what about notifications for such cases? For some reason, the worker wasn't unhealthy; it only started trotting after 1 hour of returning errors, so I'd like to set up a notification if there are a lot of fails and, if possible, without writing a custom microservice for that 😅
9 replies