Faulty node?

Since this morning, I encountered this error multiple times: 'CUDA error: uncorrectable ECC error encountered'. Everytime, after terminating the pod and starting a new one, the problem went away. All incidents were on US-GA-2, H100-PCIe
10 Replies
yhlong00000
yhlong0000011mo ago
Hey, could you share the pod ids here. We could take a look
BlackWhiteAsian
BlackWhiteAsianOP11mo ago
Didn't note them down but just encountered one again: 3sc3qsn1qhu0mz Another one. Two in a row: 3g93y1byjkjq1o Can I get refund for the time I wasted on these? Had like more than 10 of these in the past 2 days. Another one: g91ov3ym70j0rc BTW, this happens when I run kohya-scripts. But the exact same script and config works with non-faulty nodes.
Unknown User
Unknown User11mo ago
Message Not Public
Sign In & Join Server To View
yhlong00000
yhlong0000011mo ago
All three pods landed on the same machine, and I’ve delisted that machine to avoid further issues. I’ll DM you with more details.
BlackWhiteAsian
BlackWhiteAsianOP11mo ago
@yhlong00000 Hey there, just got two more faulty instances: 6kxad780u6bda9, oweexcwlv8y62k. Same error. H100 NVL Also: hq57ofbzb1xmhb, cz6iu4pzb8z8h4
Unknown User
Unknown User11mo ago
Message Not Public
Sign In & Join Server To View
BlackWhiteAsian
BlackWhiteAsianOP11mo ago
Yep. Probably the same machine. It seems to start a bit slower than working ones.
yhlong00000
yhlong0000011mo ago
what is the error message you see from the container log? The server looks good from my end.
BlackWhiteAsian
BlackWhiteAsianOP11mo ago
Same like before: 'CUDA error: uncorrectable ECC error encountered' when I ran kohya scripts. The container itself was launched fine.
yhlong00000
yhlong0000011mo ago
all the pods you list above running on the same machine and associate with the same card, I will reach out to DC and check it. Ping me here if you see more of this error.

Did you find this page helpful?