Mi300x HIP error: no ROCm-capable device is detected

I'm using the Mi300x and getting a RuntimeError: HIP error: no ROCm-capable device is detected using RunPod Pytorch 2.4.0 ROCm 6.1 template, how can I resolve this?
1 Reply
Snektron
Snektron2w ago
I'm having the same issue, but it only happened recently (~30m ago). Seems to be caused after a GPU hang, restarting the pod doesn't work. When this happens to our own nodes its usually fixed automatically
Want results from more Discord servers?
Add your server