Problem with hanging pod

Hi team, I've issue about the hanging pod, somehow the GPU is crashed and now all the process is hanging Tried to restart the pod, it didn't work. Tried to stop and start again, and now it's can't get the pod up Please help me with this. This is the pod ID: whainytdwrgb7l
17 Replies
ferdinandjason
ferdinandjasonOP•3w ago
Now it's getting this
No description
Jason
Jason•3w ago
oh maybe there's some problem with the pod then
ferdinandjason
ferdinandjasonOP•3w ago
can up now, but it's restarted / stopped multiple times it's really disturbing right now and now https://discord.com/channels/912829806415085598/1359897197201719327/1359898014323572817 it's happening again and I'm still charged now stopped the pod
Jason
Jason•3w ago
hmm what did you run
ferdinandjason
ferdinandjasonOP•3w ago
I'm training machine learning model right now, so some python scripts
Jason
Jason•3w ago
Any error message related to the "gpu crash"? Or what does that means? Might be related to too much ram used ( full), you might need higher pod ram
ferdinandjason
ferdinandjasonOP•3w ago
no error message related sadly the training loop is somehow stuck, and not going any further, and when running nvidia-smi command it's still hangging around 7 hours the VRAM is around 66% usages
Jason
Jason•3w ago
I see The ram? What about the ram I'd suggest moving and trying in another pod But surely you can report this to support ticket, have you created one?
ferdinandjason
ferdinandjasonOP•3w ago
from the statistic, the ram is only 2% at max
No description
ferdinandjason
ferdinandjasonOP•3w ago
I'd suggest moving and trying in another pod
I still need the data / the training checkpoint, how do to that?
But surely you can report this to support ticket, have you created one?
created in this https://contact.runpod.io/hc/en-us/requests/15833
Jason
Jason•3w ago
Okayy Ohh, you dont have any backup now? Can you access the terminal and run commands then? If yes you can install any file transfer tool that can be used over the internet Like rsync rclone, syncthing Then move the data to the other pod Try to restart the pod if you got your Data in your volume ( mounted path)
ferdinandjason
ferdinandjasonOP•3w ago
yes, I don't have any backup right now 😦 tried to start the pod again ~10 minutes ago, but it's not getting up until right now
Jason
Jason•3w ago
What's the logs right rn
ferdinandjason
ferdinandjasonOP•3w ago
no logs
Jason
Jason•3w ago
Well what template are you using Maybe the pod is down, just wait for support then
riverfog7
riverfog7•3w ago
Maybe look at wandb logs And there should be a copy of ur code in wandb

Did you find this page helpful?