R
RunPod6mo ago
Vitali

Services Stopped

Hi team, Could somebody help me with the issue? I have my pod running - runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu 4 RTX 4090 To start my AI training program I write commands via command line and the process starts. But then after 1-4 hours, the process stops somehow so that I need to retype all the commands to start the process again. What may stop the process? Why I need to restart everything 3-4 times per day?
8 Replies
Justin Merrell
Justin Merrell6mo ago
Is it stopping or is it the ssh connection that is resetting?
Vitali
Vitali6mo ago
No, stopping service The connection is not resetting Actually, I have the following issue. Once connected via SSH to the server, I run commands to start AI services running. And they start running and run well. But when I close my laptop or close my terminal, the SSH connection drops which seems to be ok, but AI services stop.
ashleyk
ashleyk6mo ago
You can use screen or tmux for this.
Vitali
Vitali6mo ago
What is it? How can I resolve the issue?
ashleyk
ashleyk6mo ago
SSH connections can't stay open if you close your laptop and the training can't continue if you close the terminal. screen/tmux start a background session that you can resume later if you need to close your laptop or terminal. I would highly recommend using them for training in any case. Screen is typically easier to use than tmux.
Vitali
Vitali6mo ago
Screen is also a terminal?
ashleyk
ashleyk6mo ago
No, its basically a session manager. You use it within the terminal.
Vitali
Vitali6mo ago
Ok, I'll read about this. Thanks for the advise. Try to do)