Serverless instances are not assigned GPUs, resulting in job error in Production. Require Assist
Error Message 1 with Stack Trace:
Task Failed [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=0220236a79a1 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=177 ; expr=cudnnCreate(&cudnnhandle); \n
Error Message 2:
Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer
Will refreshing the worker help in this situation ?
5 Replies
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Got it thanks, but Error Message 1 indicates cudnn error
that cuDNN couldn't initialize properly, which may be due to a driver issue, memory allocation issue, or an internal cuDNN bug
Unknown User•6mo ago
Message Not Public
Sign In & Join Server To View
If this hasn't cleared up, can you share your worker or endpoint id and I'll take a look at it?
Thanks for the response, For now, i have refreshed the worker on giving these errors, I will ping here if this error this comes.