Faulty node?
Since this morning, I encountered this error multiple times: 'CUDA error: uncorrectable ECC error encountered'.
Everytime, after terminating the pod and starting a new one, the problem went away.
All incidents were on US-GA-2, H100-PCIe
10 Replies
Hey, could you share the pod ids here. We could take a look
Didn't note them down but just encountered one again: 3sc3qsn1qhu0mz
Another one. Two in a row: 3g93y1byjkjq1o
Can I get refund for the time I wasted on these? Had like more than 10 of these in the past 2 days.
Another one: g91ov3ym70j0rc
BTW, this happens when I run kohya-scripts. But the exact same script and config works with non-faulty nodes.
Unknown User•11mo ago
Message Not Public
Sign In & Join Server To View
All three pods landed on the same machine, and I’ve delisted that machine to avoid further issues. I’ll DM you with more details.
@yhlong00000 Hey there, just got two more faulty instances: 6kxad780u6bda9, oweexcwlv8y62k. Same error. H100 NVL
Also: hq57ofbzb1xmhb, cz6iu4pzb8z8h4
Unknown User•11mo ago
Message Not Public
Sign In & Join Server To View
Yep.
Probably the same machine. It seems to start a bit slower than working ones.
what is the error message you see from the container log? The server looks good from my end.
Same like before: 'CUDA error: uncorrectable ECC error encountered' when I ran kohya scripts. The container itself was launched fine.
all the pods you list above running on the same machine and associate with the same card, I will reach out to DC and check it. Ping me here if you see more of this error.