We have detected a critical error on this machine which may affect some pods.
Hey all. We're renting a number of H100s as a trial run of Runpod as we are looking for another compute provider. We paid for 24 hours of compute in order to transfer terabytes of data onto the machine, alongside paying for bandwidth and additional storage. We additionally paid our cloud provider egress costs, which is more than we paid for the H100 machine, and rented a disk & network optimized machine in order to transfer the data quickly to the Runpod machine.
After 24 hours, we are getting this error on the Runpod GUI:
We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime.
Running nvidia-smi gives an ERR! for the 3rd GPU.
What are our options here? Is this an error that will be fixed by Runpod, or have I paid for a faulty machine?
Similarly, is there any way to use the persistent volume disk we are currently paying for and have it attached to a different H100 machine, so we do not have to spend another 24 hours transferring data & paying additional fees? Please advise.
After 24 hours, we are getting this error on the Runpod GUI:
We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime.
Running nvidia-smi gives an ERR! for the 3rd GPU.
What are our options here? Is this an error that will be fixed by Runpod, or have I paid for a faulty machine?
Similarly, is there any way to use the persistent volume disk we are currently paying for and have it attached to a different H100 machine, so we do not have to spend another 24 hours transferring data & paying additional fees? Please advise.



