R
Runpod4mo ago
Murex90

Pod Running

I’m having a problem with the operation of a program. I create the pod, connect, the desktop opens, I set the parameters I need, start the job, it begins to “work” but then reaches “Cache Latent” and the desktop disconnect. I reload the page, the desktop appears again but after a few seconds it disconnect again. I reload the page and it says it's impossible to connect (forcing me to reconnect from scratch and re-enter all the parameters). I’ve tried several different GPUs but nothing changes. I also tried using the “web terminal” but when I paste the link into the browser, it says it can’t connect. I’ve already spent several dollars without even managing to complete 1% of the job. How can I fix this issue and actually get the job to run? Thank you.
51 Replies
Murex90
Murex90OP4mo ago
@Dj
Dj
Dj4mo ago
Hey, thanks for the ping :) I unfortunately don't know a thing about Stable Diffusion, if there's some kind of verbose logging you can enable that would be great. I don't know what is happening behind the scenes, but it could be a lot really. If you'd like you can respond with a log or contact support who may have better knowledge about this - sorry. I can help you with a credit for the credits you've spent especially if you're willing to grab logs though!
Murex90
Murex90OP4mo ago
Thank you for your reply. As for the logs, I can only see the pod log, because when the desktop disconnects I can no longer see the job terminal (or any errors that might appear after “cache latent”). This is when using Linux. With the web terminal, it doesn’t even open, so I can’t even start the job. What do you mean by “you're willing to grab logs though!”?
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP4mo ago
I understood you meant the details of the problem, but I wanted to know what you meant in relation to the credits I spent. As for what I'm trying to do specifically, I'll try again and take screenshots (if you tell me what exactly, because the only thing I can see after disconnecting, as I said, is the pod log) and the terminal logs while it's still "working””. In the meantime, I'll write down what I want to do. I'm trying to train a model (using Kohya), but as I said, the desktop (not Kohya itself) always disconnects, and I have to reconnect to the pod every time (so any work done up until the disconnection is lost). I wanted to try using the web terminal for greater stability, but when I copy the link into my browser, it says it's impossible to connect (so I have no feedback on this and I'd also like to understand how to connect via the web terminal, in addition to understanding the previous problem).
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP4mo ago
Oh, about the refund, they decide if refund it or not based on the log. As I was saying, it's not Kohya that disconnects. The training itself starts, it loads the dataset, runs the initial parameters, then reaches "Text Encoder 2". It stays there for a while, and then the desktop disconnects (while Kohya is still connected). I reload the page, the desktop reappears, but after a few seconds it disconnects again, and reloading the page says it’s impossible to connect. Template: https://console.runpod.io/deploy?gpu=RTX%203090&count=1&template=9thomk0pjf
Murex90
Murex90OP4mo ago
Training Log:
Murex90
Murex90OP4mo ago
Pod Log:
Murex90
Murex90OP4mo ago
@Jason @Dj
Dj
Dj4mo ago
This error indicates your Pod is running out of normal RAM, not VRAM whilst doing it's job. The amount of RAM we give you scales with the type and amount of GPUs you select in the UI.
Murex90
Murex90OP4mo ago
@Dj Ok, so how do you fix it? I don't want to keep throwing money away without even being able to get past that point, especially since I've already spent quite a bit. How can I keep it stable? Also, I need to run a fairly long task, and if it already crashes during the test after less than ten minutes, I really need it to stay stable. If the issue is the CPU, how can it be resolved? I’d also like to understand how to use the 'web terminal', which is supposed to be more stable.
Dj
Dj4mo ago
Usage of the web terminal won't really fix the issue here, you're running out of RAM. I don't know anything about your workload, so I can't make specific recommendations but I did manage to find your Runpod account based on what I know. The first Pod you created to trigger an out of memory error was sol8r7vaquhe73 which did indeed spike the RAM and CPU usage.
No description
Dj
Dj4mo ago
What I can recommend is paying a lot of attention to the "Pod Summary" card on the Deploy Page. When selecting 1 3090, you can see we provide you 125GB of RAM.
No description
Dj
Dj4mo ago
When you select two, that number doubles
No description
Dj
Dj4mo ago
You can use the web terminal by toggling it on, pressing the "Connect" button on the console page and then enabling it when given the option.
No description
Murex90
Murex90OP4mo ago
Yes, that's what I did, I launched the terminal, installed the components, it gave me the link, but when I pasted it into the web page, it said: 'Unable to connect' (I tried several times, but it never connected). So should I use two of them? And if it doesn’t exceed 250GB, it shouldn’t disconnect, right?
Dj
Dj4mo ago
Yes, but I don't know for certain what's even happening inside your container >.<
Murex90
Murex90OP4mo ago
Ok, I’ll give it a try, but if it doesn’t work, how can I show you what’s happening inside the container?
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
For this test a LoRa with small dataset, the main work a model with a large dataset.
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
@Jason I tried using two CPUs, and after more than a dozen attempts where I got various errors, including “CUDA out of memory,” I finally managed to start the training (almost 3 hours since I began, due to all the errors I had to fix, none of these errors showed up last time, the training started immediately). But just like last time, it gets to “Text encoder 2” and then the desktop crashes.
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
Just like last time, I used the desktop (the web terminal doesn't even connect, I'd like to understand how to connect it). As for that error, I know about it, in fact, I have fixed it, but turn back to the same problem as last time: the "crash” 2x RTX 3090 (48 GB VRAM) 250 GB RAM • 64 vCPU Total Disk: 180 GB
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
Kohya, SDXL-Basic parameters, I only modified epochs and steps since this is just a test. @Jason
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
Ah sorry, I didn't realize you were asking for try it out. These are the parameters I used: Model: SDXL base 1.0, presets: sdxl-1 image LoRA v1.0, basic: standard, batch: 1, epoch: 1, max train: 1, steps: 3000, save every n epochs: 1, stop TE: 80%, samples every n steps: 500, sample every n epochs: 1. Try it out when you have time, the remaining credits wouldn't be enough for a full training session anyway, I'd need to top up first.
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
Do you mean to leave all the default parameters and only set “network_train” in the 'Advanced' section?
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
I'll give it a try to see, even though the basic parameters aren't good, for example, the base 'steps' are only 160 (a bit too few). I ran the test, but nothing changes, it still crashes at the same point.
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
As I was saying last time, the web terminal doesn't work, or rather, the link doesn't. I open the terminal, launch the Kohya setup, it loads all the dependencies, then generates the link. I copy and paste the link into the browser, and it says “unable to connect.
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
Obviously I clicked on 'connect', otherwise I wouldn't even be able to open the web terminal. What doesn't work is the link. I click connect – web terminal – run the Kohya command – it installs all the dependencies – at the end it gives me the link (just like on desktop), but on desktop the link opens Kohya without any issues, whereas when I copy the link from the web terminal, it says 'unable to connect' (Kohya doesn't even launch). Where am I supposed to put 'argument' if I can't even see Kohya?
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
Thank you for taking the time to try it yourself and for the template. I tried to use it, but when I click on 'Start', it loads for a few seconds and then stops, it doesn't connect.
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
I tried again and now the direct link appears (the web terminal at the bottom still does the same 'start and stop' as yesterday), it says 'not ready' but Kohya does open. However, there’s an issue with the dataset, obviously, when I enter the folder path it can’t find it because it’s on my computer, and since there’s no desktop (like in the previous 'Kasm' template), I can’t download the dataset. How am I supposed to use the dataset with this new template?
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
How can i do this?
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
I meant the “dataset”, how can i upload on pod?
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
Thank you for the work you've done. I tried again, and now both the launcher and Jupyter are there (both still say 'not ready' even after 20 minutes). I was able to upload the dataset, but then Kohya doesn't detect it, and whatever I try it said ''connection error out''.
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
I tried different regions/servers, with one, it gave the same “connection error out” as before; with another, it only had Jupyter and the web terminal (no launcher). I opened the web terminal, installed all the dependencies, it generated the link, I clicked the link and it says “unable to connect”.
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Murex90
Murex90OP3mo ago
Yes i try both

Did you find this page helpful?