R
Runpod11mo ago
Brawl

RunPod disconnecting/resetting during model training

Hi everyone, I've encountered an issue several times over the past week and have yet to successfully complete a model because of it. I've triple-checked to ensure I'm using an On-Demand instance. However, after a few hours of running my model, the web server or Jupyter notebook loses its connection. When I reconnect, the session appears to have reset: • If I use the web server, when I reconnect, the terminal is blank. • If I use Jupyter Notebook, the kernel is idle. Despite this, I can see from the pod information that something is still running, and the GPU usage indicates activity. However, I'm unable to access or resume whatever process is ongoing. As far as I know, I should be able to disconnect my internet, shut down my machine, and later log back in to find the model either completed or still running. This behavior suggests the interruption is happening on the server side rather than my end (I have funds in my account). Does anyone know why this might be happening or how to resolve it? Thanks in advance!
7 Replies
Unknown User
Unknown User11mo ago
Message Not Public
Sign In & Join Server To View
Brawl
BrawlOP11mo ago
Ah, I have connected through SSH, currently running. Is that ok?
Unknown User
Unknown User11mo ago
Message Not Public
Sign In & Join Server To View
Brawl
BrawlOP11mo ago
Ah, ok, have not heard of tmux. I will check it out now 🙂
Unknown User
Unknown User11mo ago
Message Not Public
Sign In & Join Server To View
Brawl
BrawlOP11mo ago
Ah, thank you so much nerdylive 🙂
Unknown User
Unknown User11mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?