Problem with hanging pod
Hi team, I've issue about the hanging pod, somehow the GPU is crashed and now all the process is hanging
Tried to restart the pod, it didn't work. Tried to stop and start again, and now it's can't get the pod up
Please help me with this. This is the pod ID:
whainytdwrgb7l
17 Replies
https://contact.runpod.io/hc/en-us/requests/15833
Raised in there also
Now it's getting this

oh maybe there's some problem with the pod then
can up now, but it's restarted / stopped multiple times
it's really disturbing right now
and now https://discord.com/channels/912829806415085598/1359897197201719327/1359898014323572817 it's happening again
and I'm still charged
now stopped the pod
hmm
what did you run
I'm training machine learning model right now, so some python scripts
Any error message related to the "gpu crash"?
Or what does that means?
Might be related to too much ram used ( full), you might need higher pod ram
no error message related sadly
the training loop is somehow stuck, and not going any further,
and when running
nvidia-smi
command it's still hangging around 7 hours
the VRAM is around 66% usagesI see
The ram?
What about the ram
I'd suggest moving and trying in another pod
But surely you can report this to support ticket, have you created one?
from the statistic, the ram is only 2% at max

I'd suggest moving and trying in another podI still need the data / the training checkpoint, how do to that?
But surely you can report this to support ticket, have you created one?created in this https://contact.runpod.io/hc/en-us/requests/15833
Okayy
Ohh, you dont have any backup now?
Can you access the terminal and run commands then?
If yes you can install any file transfer tool that can be used over the internet
Like rsync rclone, syncthing
Then move the data to the other pod
Try to restart the pod if you got your Data in your volume ( mounted path)
yes, I don't have any backup right now 😦
tried to start the pod again ~10 minutes ago, but it's not getting up until right now
What's the logs right rn
no logs
Well what template are you using
Maybe the pod is down, just wait for support then
Maybe look at wandb logs
And there should be a copy of ur code in wandb