RunpodR
Runpod2y ago
jherrm

"We have detected a critical error on this machine which may affect some pods." Can't backup data

During a training run with 8xH100, I started seeing strange "Directory not found" errors in my jupyter notebook which could not be dismissed (they kept popping up). Although my training run continued and completed, I wasn't able to copy the data off of the volume disk due to the modals blocking operation.

I looked into the deployment and saw the error "We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime."

Unfortunately everything I've tried to get my data doesn't work - reconnecting to the notebook, Web Terminal, SSH (both options), and even stopping and starting the pod fails.

When trying to start the pod again, it stalls on create pod network .

How do I get my data!?
image.png
image.png
Was this page helpful?