Software Engineering Courses (SECourses)•6mo ago

I thought it would be faster to train

Ppianogospel I thought it would be faster to train

Furkan Gözükara SECourses•7/17/25, 4:15 PM

nope

Ppianogospel I thought it would be faster to train

AiInfluence•7/17/25, 4:15 PM

always train on fp32

AiInfluence•7/17/25, 4:16 PM

then inference on fp16<=

AiInfluence•7/17/25, 4:16 PM

the logic is to save time on inference not on training for the best quality

Le_Docteur•7/19/25, 1:46 PM

Good morning, gentlemen, i've been using Tier1_48_GB_Faster_v2.json for dreambooth on runpod and constantly getting the training terminated on text encoders training part
So,
subprocess.CalledProcessError: Command '['/opt/environments/python/kohya/bin/python3.10', '/workspace/kohya_ss/sd-scripts/sdxl_train.py', '--config_file', '/workspace/kohya_ss/outputs/config_dreambooth-20250719-133856.toml', '--max_grad_norm=0.0', '--no_half_vae', '--train_text_encoder', , '--learning_rate_te2=0']' returned non-zero exit status 2.

As i understand kohya wants non-zero parameter for text_encoder2
should i insert the same as on te1, 0.000003?

LLe_Docteur Good morning, gentlemen, i've been using Tier1_48_GB_Faster_v2.json for dreamboo...

Furkan Gözükara SECourses•7/20/25, 12:02 AM

hi what is the exact message?

Furkan Gözükara SECourses•7/20/25, 12:02 AM

it could be out of RAM

lokitoxin•7/20/25, 3:44 AM

Hey guys. I'm currently training a flux style finetune using kohya with a dataset consisting of around 1800 images total. When leaving the max epoch to 0 (default) it was automatically set to 6. This seemed strange to me, I've never used a dataset this large. I'm using the "Batch_Size_7_48GB_GPU_46250MB_29.1_second_it_Tier_1" preset as provided by Dr. Furkan. Should I tweak another variable like learning rate for a dataset this big? I set max epochs to 10 and it says it will train for 20 hours on an A6000 GPU. I thought I'd ask before I have it train for 20 hours for it to end up being trash.

Llokitoxin Hey guys. I'm currently training a flux style finetune using kohya with a datase...

lokitoxin•7/20/25, 3:51 AM

Canceled it for now just in case someone more experienced tells me I've made a fatal error.

Llokitoxin Hey guys. I'm currently training a flux style finetune using kohya with a datase...

Furkan Gözükara SECourses•7/20/25, 9:28 AM

are these 1800 images are consistent?

Furkan Gözükara SECourses•7/20/25, 9:28 AM

for 1800 images i recommend train up to 15 epoch , take a checkpoint every epoch and compare later

FFurkan Gözükara SECourses are these 1800 images are consistent?

lokitoxin•7/20/25, 11:41 AM

Thanks for the response! They are high quality screenshots from a show with a unique art style, consistent in style only. All of them are uniquely captioned in the same format to describe the image, with a triggerword for the style at the start of the caption. I'll try as you suggested. One more question: Can I train 10 epochs, and with the resulting model, continue training for another 5 epochs in a separate instance (using the resulting model as a base) to get the same or similar result as if I were training 15 epochs from the start? A checkpoint per epoch would require me to regularly offload the models onto another server due to disk space, risking messing this up and epochs not saving.

lokitoxin•7/20/25, 11:44 AM

Also, although maybe not particularly relevant. They are actually 2 sets of the same 900 images in different aspect ratios, according to my own testing and other testing I've seen, training on different aspect ratios improves results. Particularly with style finetunes. Never tried it with this many images though.

FFurkan Gözükara SECourses it could be out of RAM

Le_Docteur•7/20/25, 3:16 PM

That was exactly it.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.20 GiB of which 16.62 MiB is free. Process 2510593 has 79.18 GiB memory in use.,

training was successful on batch size=2 on h100 sxm
This thing is demanding indeed, consumed like 77gb vram

Llokitoxin Thanks for the response! They are high quality screenshots from a show with a un...

Furkan Gözükara SECourses•7/20/25, 3:31 PM

yes

Furkan Gözükara SECourses•7/20/25, 3:31 PM

with our config 10+5 = 15 from 0

Llokitoxin Also, although maybe not particularly relevant. They are actually 2 sets of the ...

Furkan Gözükara SECourses•7/20/25, 3:31 PM

well it should work decent

Furkan Gözükara SECourses•7/20/25, 3:31 PM

make sure to enable bucketing

LLe_Docteur That was exactly it. torch.OutOfMemoryError: CUDA out of memory. Tried to allo...

Furkan Gözükara SECourses•7/20/25, 3:32 PM

yes batch size 2 for dreambooth requires huge vram and ram

Furkan Gözükara SECourses•7/20/25, 3:32 PM

but lora works great

AiInfluence•7/20/25, 3:33 PM

what config should i use on H100 @Furkan Gözükara SECourses batch size 7 48gb or 48gb Q1? whats the best quality?

Le_Docteur•7/20/25, 3:33 PM

Is 20 images enough for dreamboothoing? or i need more ( i have 300)

AAiInfluence what config should i use on H100 <@205854764540362752> batch size 7 48gb or 48g...

Furkan Gözükara SECourses•7/20/25, 3:33 PM

q1 best

Furkan Gözükara SECourses•7/20/25, 3:34 PM

but if you want speed use batch size 7

AiInfluence•7/20/25, 3:34 PM

no i want quality

LLe_Docteur Is 20 images enough for dreamboothoing? or i need more ( i have 300)

Furkan Gözükara SECourses•7/20/25, 3:34 PM

more better

AiInfluence•7/20/25, 3:34 PM

so tier 1 ok thanks

Furkan Gözükara SECourses•7/20/25, 3:34 PM

definitely

AAiInfluence so tier 1 ok thanks

Furkan Gözükara SECourses•7/20/25, 3:34 PM

yes

Le_Docteur•7/20/25, 3:34 PM

repeats the same? 150-200 or i need to lower?

AiInfluence•7/20/25, 3:35 PM

btw doctor for multitalk very good and stable results gives the default t2v lighxv2 rank 32

LLe_Docteur repeats the same? 150-200 or i need to lower?

Furkan Gözükara SECourses•7/20/25, 3:35 PM

lower

Furkan Gözükara SECourses•7/20/25, 3:35 PM

for 300 images do like maximum 100 epoch and compare

AAiInfluence btw doctor for multitalk very good and stable results gives the default t2v ligh...

Furkan Gözükara SECourses•7/20/25, 3:35 PM

well i didnt test to many times

FFurkan Gözükara SECourses well i didnt test to many times

AiInfluence•7/20/25, 3:37 PM

im closing my h100 soon

AiInfluence•7/20/25, 3:37 PM

get ready to test

AiInfluence•7/20/25, 3:37 PM

thanks

Furkan Gözükara SECourses•7/20/25, 3:38 PM

so your h100 failing at sage attention?

Furkan Gözükara SECourses•7/20/25, 3:38 PM

or what error you getting

AiInfluence•7/20/25, 3:38 PM

no h100 good but c10 error

AiInfluence•7/20/25, 3:38 PM

on first launch

AiInfluence•7/20/25, 3:38 PM

underdog•7/20/25, 11:42 PM

@Furkan Gözükara SECourses Hello,

I was wondering if you have a python code that I can refer/modify for finetuning instead of using the kohya gui? I want to load the parameters and train without having to open the interface and load the parameters all the time.