R
RunPod3mo ago
Clion

Connectivity issue on 4090 pod

Hello Runpod, I've been unable to access a stopped 4090 pod for quite some time now (approx 10-12 hrs). The pod ID is 24kw7y5uu2yuil, in the IS datacenter. During this time, the attached notice about a network outage has been displayed for that pod, and the process to launch the pod gets stuck at Waiting for logs as in the second attached image. This happens when trying to launch with any number of the pod's GPUs (0-8 inclusive). I do not need to use this pod's GPUs but I do have some important data I need to transfer from it. I've been waiting to post something about this since I've been assuming the network issue is transient, but as it's been happening since before I went to bed last night, I figured I would reach out to see if there's any way I can get the data off of this pod. Thanks!
No description
No description
8 Replies
Madiator2011
Madiator20113mo ago
I see that machine is unlisted. @Clion your data is stored on volume or network storage.
Clion
Clion3mo ago
It's stored on the pod's volume, not independent network storage
Madiator2011
Madiator20113mo ago
Will see what can be done
Clion
Clion3mo ago
Tyvm 🙏 Also, if it's possible/easier for you guys, I would be happy to just take a credit for the compute time it took to produce the data I'm trying to access. It was like 8-10hrs on that pod I think, and just redoing it isn't really that much of a bother Any updates on this? Apologies, not trying to rush you, just trying to determine whether I should be spinning up a new pod and redoing the training I did yesterday or if there's actually a chance that this pod's storage can be accessed to pull the models off of it
Satish
Satish3mo ago
@Clion We are unable to access the host. I have sent a message to the DC team and am waiting for their response.
Clion
Clion3mo ago
Is it possible to just say fk it and take a compute time credit? I've got waiting tasks reliant on the trained models on that pod and I'd guess that the endgame here is going to be "all temporary storage on affected machines is lost" so if I'm gonna have to redo 8-10hrs of compute I'd pref to get started sooner rather than later. Naturally though, decision is up to you guys, I'm not trying to cause anyone a hard time 🙃 Alright guys, I redid the training to reproduce the models that were on this pod, so I no longer need what is/was on this storage. But could I get a credit for the compute time that I used to do so? I think this is fair to ask, since hosts making pods with active storage inaccessible without notice is not really something that users are told to expect as a possibility, and I was paying the disk fees to keep that pod's storage active specifically so I could retrieve data from it later
Satish
Satish3mo ago
@Kadeeja
Kadeeja
Kadeeja3mo ago
Hi @Clion Could you DM your RunPod email? I can help you out here