RunpodR
Runpod2mo ago
lilshake

Data Loss Due to Critical Hardware Failure

My pod was one of the pods affected by a IS-3 H100 hardware failure, and I received the email saying my data is not recoverable. This was a critical, extremely expensive training run, and the automated support email is telling me to wait 1–2 business days, which is absolutely not workable for a hardware-fault incident on your side.

Ticket: #27104
Issue: Full data loss + unclear pod state + no guidance on whether I need to restart the entire setup.

I need a human from the support or engineering team to look at this now .,at minimum I need to know:

Is the node/pod coming back online?

Is the training job completely lost?

Do I need to rebuild everything from scratch?

This is blocking my work and I cannot wait multiple days for a first response.
Was this page helpful?