R
RunPod4mo ago
ashleyk

Broken serverless worker - can't find GPU

Serverless worker qbw30nmknd6cmh is broken can't can't find the GPU.
{
"dt":"2024-02-19 23:34:37.252459"
"endpointid":"qbw30nmknd6cmh"
"level":"error"
"message":"An exception was raised: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. Name:'Conv_24' Status Message: CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=acb6f843d220 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/conv.cc ; line=382 ; expr=cudnnFindConvolutionForwardAlgorithmEx( GetCudnnHandle(context), s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), max_ws_size); "
"workerId":"ptrh2jn7wjkcmd"
}
{
"dt":"2024-02-19 23:34:37.252459"
"endpointid":"qbw30nmknd6cmh"
"level":"error"
"message":"An exception was raised: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. Name:'Conv_24' Status Message: CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=acb6f843d220 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/conv.cc ; line=382 ; expr=cudnnFindConvolutionForwardAlgorithmEx( GetCudnnHandle(context), s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), max_ws_size); "
"workerId":"ptrh2jn7wjkcmd"
}
1 Reply
ashleyk
ashleyk4mo ago
It might also be worth mentioning that this is the first time I've seen this error in almost 12,000 requests to the endpoint.