Data Loss Due to Critical Hardware Failure

My pod was one of the pods affected by a IS-3 H100 hardware failure, and I received the email saying my data is not recoverable. This was a critical, extremely expensive training run, and the automated support email is telling me to wait 1–2 business days, which is absolutely not workable for a hardware-fault incident on your side.

Ticket: #27104
Issue: Full data loss + unclear pod state + no guidance on whether I need to restart the entire setup.

I need a human from the support or engineering team to look at this now .,at minimum I need to know:

Is the node/pod coming back online?

Is the training job completely lost?

Do I need to rebuild everything from scratch?

This is blocking my work and I cannot wait multiple days for a first response.

Runpod•5mo ago•

24 replies

lilshake

Data Loss Due to Critical Hardware Failure

Data Loss Due to Critical Hardware Failure

Similar Threads

Data Loss Due to Critical Hardware Failure

Similar Threads

Similar Threads

Similar Threads