Intermittent "CUDA error: device-side assert triggered" on Runpod Serverless GPU Worker

I’m deploying a GPU-based service on Runpod Serverless. Most requests run fine, but after some time I start getting errors like: "error": "CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.", "executionTime": 120575
1 Reply
Unknown User
Unknown User4d ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?