R
Runpod•7mo ago
ferdinandjason

Problem with hanging pod

Hi team, I've issue about the hanging pod, somehow the GPU is crashed and now all the process is hanging Tried to restart the pod, it didn't work. Tried to stop and start again, and now it's can't get the pod up Please help me with this. This is the pod ID: whainytdwrgb7l
17 Replies
ferdinandjason
ferdinandjasonOP•7mo ago
Now it's getting this
No description
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
ferdinandjason
ferdinandjasonOP•7mo ago
can up now, but it's restarted / stopped multiple times it's really disturbing right now and now https://discord.com/channels/912829806415085598/1359897197201719327/1359898014323572817 it's happening again and I'm still charged now stopped the pod
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
ferdinandjason
ferdinandjasonOP•7mo ago
I'm training machine learning model right now, so some python scripts
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
ferdinandjason
ferdinandjasonOP•7mo ago
no error message related sadly the training loop is somehow stuck, and not going any further, and when running nvidia-smi command it's still hangging around 7 hours the VRAM is around 66% usages
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
ferdinandjason
ferdinandjasonOP•7mo ago
from the statistic, the ram is only 2% at max
No description
ferdinandjason
ferdinandjasonOP•7mo ago
I'd suggest moving and trying in another pod
I still need the data / the training checkpoint, how do to that?
But surely you can report this to support ticket, have you created one?
created in this https://contact.runpod.io/hc/en-us/requests/15833
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
ferdinandjason
ferdinandjasonOP•7mo ago
yes, I don't have any backup right now 😦 tried to start the pod again ~10 minutes ago, but it's not getting up until right now
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
ferdinandjason
ferdinandjasonOP•7mo ago
no logs
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog7•7mo ago
Maybe look at wandb logs And there should be a copy of ur code in wandb

Did you find this page helpful?