Data Loss Due to Critical Hardware Failure
My pod was one of the pods affected by a IS-3 H100 hardware failure, and I received the email saying my data is not recoverable. This was a critical, extremely expensive training run, and the automated support email is telling me to wait 1–2 business days, which is absolutely not workable for a hardware-fault incident on your side.
Ticket: #27104
Issue: Full data loss + unclear pod state + no guidance on whether I need to restart the entire setup.
I need a human from the support or engineering team to look at this now .,at minimum I need to know:
Is the node/pod coming back online?
Is the training job completely lost?
Do I need to rebuild everything from scratch?
This is blocking my work and I cannot wait multiple days for a first response.
Ticket: #27104
Issue: Full data loss + unclear pod state + no guidance on whether I need to restart the entire setup.
I need a human from the support or engineering team to look at this now .,at minimum I need to know:
Is the node/pod coming back online?
Is the training job completely lost?
Do I need to rebuild everything from scratch?
This is blocking my work and I cannot wait multiple days for a first response.