I stopped using the kohya GUI because I couldn't see how to enable multi-gpu, so I've just been usin

I stopped using the kohya GUI because I couldn't see how to enable multi-gpu, so I've just been using a command line argument. This is the argument:

accelerate launch --num_cpu_threads_per_process=16 --num_processes=6 --multi_gpu --num_machines=1 --gpu_ids=0,1,2,3,4,5 "./sdxl_train.py" --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --train_data_dir="training/img" --reg_data_dir="training/reg" --resolution="1024,1024" --output_dir="training/model" --logging_dir="training/log" --save_model_as=safetensors --full_bf16 --vae="stabilityai/sdxl-vae" --output_name="TESTsuperxl" --lr_scheduler_num_cycles="8" --max_data_loader_n_workers="0" --learning_rate_te1="3e-06" --learning_rate_te2="0.0" --learning_rate="1e-05" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="9600" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01 --max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --bucket_no_upscale --noise_offset=0.0 --max_grad_norm=0.0 --no_half_vae --train_text_encoder


I've tried this with a batch size of 1 and 2. 2 is actually slower than 1, and anything higher than 2 gives me an out of memory error.

I've gone ahead and tried doing a repeat of 40 on the original images, and now 7 (close enough to 40 / 6).

Changing that didn't change the total number of optimization steps, which was always 9600, or the time to complete training. I think it just wanted to do more epochs.

After changing the repeats to 7 and removing the flag for
--max_train_steps="9600"
Now it's trying to do 46 epochs for a total of 1600 optimizations steps. It still says it's going to take over 4 hours.

I tried using the
--ddp_gradient_as_bucket_view
flag as specified in the updated sd-scripts repo, but that made it 5-6x slower.
Was this page helpful?