41see
41see
RRunPod
Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
I'm using a 5xH100 pod and got uncorrectable ECC error for device 1,2,3. Device 0 and 4 can be used without a problem. It seems the device or the system needs a reboot. Any help on this? I've already submitted a ticket on the website with the pod id. Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import torch torch.cuda.device_count() 5 torch.tensor([1], device='cuda:0') tensor([1], device='cuda:0') torch.tensor([1], device='cuda:1') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:2') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:3') Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: uncorrectable ECC error encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. torch.tensor([1], device='cuda:4') tensor([1], device='cuda:4')
84 replies