Got error on start of training, I have not gotten this error before. done 20 trainings and no problems starting:
"024-12-09 12:23:35 INFO Loading state dict from flux_utils.py:330
/home/Ubuntu/Downloads/t5xxl_fp16
.safetensors
Traceback (most recent call last):
File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/utils.py", line 366, in load_safetensors
state_dict = load_file(path, device=device)
File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
with safe_open(filename, framework="pt", device=device) as f:
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train.py", line 849, in <module>
train(args)
File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train.py", line 222, in train
t5xxl = flux_utils.load_t5xxl(args.t5xxl, weight_dtype, "cpu", args.disable_mmap_load_safetensors)
File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_utils.py", line 331, in load_t5xxl
sd = load_safetensors(ckpt_path, device=str(device), disable_mmap=disable_mmap, dtype=dtype)
File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/utils.py", line 368, in load_safetensors
state_dict = load_file(path) # prevent device invalid Error
File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
"

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 12:38 PM

found it I think, the t5xxl was 9.0 GB, that means it hadn't completed the download...

�🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭Got error on start of training, I have not gotten this error before. done 20 tra...

Furkan Gözükara SECourses•12/9/24, 12:40 PM

fixed batch pre processor

Furkan Gözükara SECourses•12/9/24, 12:40 PM

model download fgailed

Furkan Gözükara SECourses•12/9/24, 12:40 PM

yes you found accurately

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 12:41 PM

Thank you Dr. Great work as always

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 12:55 PM

Can I use the option RTX A6000 NVLink with 2gpus and run 2 trainings at once?

�🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭Can I use the option RTX A6000 NVLink with 2gpus and run 2 trainings at once?

Furkan Gözükara SECourses•12/9/24, 12:59 PM

yes for lora

Furkan Gözükara SECourses•12/9/24, 12:59 PM

for fine tune i didnt test probably OOM

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 1:00 PM

OOM? it's 48 GB each

�🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OOM? it's 48 GB each

Furkan Gözükara SECourses•12/9/24, 1:02 PM

yes

Furkan Gözükara SECourses•12/9/24, 1:02 PM

fine tuning need 80 gb gpus

Furkan Gözükara SECourses•12/9/24, 1:02 PM

when you do multi gpu

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 1:02 PM

I'm talking two separate trainings

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 1:02 PM

on same VM

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 1:02 PM

I already did that with 2 a6000 before, so that works, but I'm unsure what the NVlink means

�🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭I'm talking two separate trainings

Furkan Gözükara SECourses•12/9/24, 1:35 PM

seperate works

Furkan Gözükara SECourses•12/9/24, 1:35 PM

nvlink used when multiple gpu does a thing together

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 1:36 PM

CHat GPT O1-preview gave this answer: "Yes, you can run the two GPUs separately, treating each as its own device. Even though your RTX A6000 GPUs are connected via NVLink—which enables high-bandwidth peer-to-peer communication and memory access between them—they still present themselves as two distinct GPU devices to the system and to frameworks like PyTorch or TensorFlow"

Furkan Gözükara SECourses•12/9/24, 1:37 PM

yep

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 2:54 PM

If I have 1.jpg 1.txt 2.jpg 2.txt, the captions will automatically be read from the txt files in kohya, yes?

�

🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭OP•12/9/24, 7:51 PM

is it possible to convert an sdxl lora to a flux lora?

�🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭If I have 1.jpg 1.txt 2.jpg 2.txt, the captions will automatically be read from ...

Furkan Gözükara SECourses•12/9/24, 9:18 PM

yes

�🍭🎀 𝒜𝓋𝒶 𝐹𝓇𝒾𝑔𝑔 🎀🍭is it possible to convert an sdxl lora to a flux lora?