Jupiter notebook (In chrome tab) consistently crashing after 20 hours

My Jupiter lab notebook chrome tab has crashed in the middle of 22 hours of training a model, how do i know if it's still training it, if it has stopped, or if it is just running without doing anything? This has happened to me 3 times in a row and this time i would like to know what is happening. The GPU usage is going up and down with is suggesting it is training and simply not showing on the notebook, but i would like to make sure.
No description
No description
No description
13 Replies
Zeke John
Zeke John5mo ago
any update? 48 hours running and still nothing.
No description
justin
justin5mo ago
@Justin / @Madiator2011 - Just tagging staff who maybe can give your pod a look. My guess is that in the future can click Logs, and can always do a connect > web terminal, or direct ssh to your pod. Hard to know why your chrome tab is crashing though.
Madiator2011
Madiator20115mo ago
try run command from the terminal using screen or texum
justin
justin5mo ago
I would say though since your GPU uitlization is there (especially since it seemed to go up?? prob still working?
Madiator2011
Madiator20115mo ago
Jupiter is not advices to run long jobs
Madiator2011
Madiator20115mo ago
NetworkChuck
YouTube
you need to learn tmux RIGHT NOW!!
Spin up your next project with Linode: https://ntck.co/linode –You get a $100 Credit good for 60 days as a new user! I just started using Tmux……it’s amazing! If you use a terminal or CLI in any capacity Tmux will 10x your productivity in 10 seconds. From creating multiple panes and windows with ease to leaving your terminal sessions active as...
Madiator2011
Madiator20115mo ago
I also made template for alternative notebook system https://runpod.io/gsc?template=9ehepsqiw2&ref=vfker49t It has some cool things but also cons: Pros: - Background jobs you can close website and it will still run including output - If you switch to next gen Zeppelin you get much modern UI Cons: - Jupiter Notebooks are not direct compatible - No upload files via drag and drop (mayby I should move it to pros)
justin
justin5mo ago
Lol, I think the upload files via drag and drop is a pro. Force people to use runpodctl, or direct ssh to scp a zip over. (I think I remember, flash saying that runpodctl still has a middle server, so seems like the direct ssh is always the best?)
Madiator2011
Madiator20115mo ago
Uploading via drag and drop: - slow upload speed - easy to corupt files
justin
justin5mo ago
do u happen to have a repo for this? just curious.
Madiator2011
Madiator20115mo ago
it's runPod pytorch template with installed https://zeppelin.apache.org/
Zeke John
Zeke John5mo ago
@justin @Madiator2011 Thank you so much for all the advice and tips, in the future i will definitely be using tmux to train models or for any long running jobs, the model finally finish training and i'm currently downloading it to see if it works ❤️
Madiator2011
Madiator20115mo ago
For training always console I know how many times training filed cause colab or jupiter stoped working.