Problem with hanging pod
Hi team, I've issue about the hanging pod, somehow the GPU is crashed and now all the process is hanging
Tried to restart the pod, it didn't work. Tried to stop and start again, and now it's can't get the pod up
Please help me with this. This is the pod ID:
whainytdwrgb7l
17 Replies
https://contact.runpod.io/hc/en-us/requests/15833
Raised in there also
Now it's getting this

Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
can up now, but it's restarted / stopped multiple times
it's really disturbing right now
and now https://discord.com/channels/912829806415085598/1359897197201719327/1359898014323572817 it's happening again
and I'm still charged
now stopped the pod
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
I'm training machine learning model right now, so some python scripts
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
no error message related sadly
the training loop is somehow stuck, and not going any further,
and when running
nvidia-smi command it's still hangging around 7 hours
the VRAM is around 66% usagesUnknown User•7mo ago
Message Not Public
Sign In & Join Server To View
from the statistic, the ram is only 2% at max

I'd suggest moving and trying in another podI still need the data / the training checkpoint, how do to that?
But surely you can report this to support ticket, have you created one?created in this https://contact.runpod.io/hc/en-us/requests/15833
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
yes, I don't have any backup right now 😦
tried to start the pod again ~10 minutes ago, but it's not getting up until right now
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
no logs
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Maybe look at wandb logs
And there should be a copy of ur code in wandb