Hello everyone. I am Dr. Furkan Gözükara. PhD Computer Engineer. SECourses is a dedicated YouTube channel for the following topics : Tech, AI, News, Science, Robotics, Singularity, ComfyUI, SwarmUI, ML, Artificial Intelligence, Humanoid Robots, Wan 2.2, FLUX, Krea, Qwen Image, VLMs, Stable Diffusion
I also have another question.. the answer migth be somewhere but i havent come across it yet.. in your video you said one can have as many consepts as theyd like 1. can i have a consept for the face like in your video with the masks and have another consept to do the whole body inclusive of face.. 2. the lora is about 6gb so is it loaded as a checkpoint or using a lora loader..
can someone take a look at my dataset images and tell me if theyre good or not.. im training for SDXL and im running mad why my training isnt even close to okay..im using one trainer but none of my trainings have been good
I’m preparing for fine-tuning the F5 TTS model for Polish, since there isn’t one available yet. I did find one person who created a Polish model using about 90 hours of recordings and trained it on an A100 80GB for around 24 hours. Unfortunately, he didn’t share that model. https://www.youtube.com/watch?v=K6vY9Je4ufQ
That’s why I decided to give it a try myself. There isn’t much information online about TTS training configurations, unlike with photo or video models. Based on what I managed to gather so far:
My dataset contains 142 hours of correct Polish speech. The dataset has been split into smaller files with transcripts (the transcription process is still ongoing).
As for the configuration, I’m not entirely sure if it’s correct, but I plan to start training with the following settings:
I don’t know if it will work, and I also don’t know how long it will take on an RTX 4090. Possibly a few days! XD
So, if anyone here has done a similar training and could help me out with tips or suggestions, I’d really appreciate it.
Yesterday, I ran a very short test training with just a 2-hour dataset. Unfortunately, the process crashed during the night, but it managed to reach 2500 steps. I saved sample outputs every 500 steps, so I have five of them. I must say, at 500 steps the difference between the reference wav and the generated file was huge – as a native Polish speaker, I couldn’t understand a single word from the generated one. But at 2500 steps, it was already intelligible. Lots of mistakes, but at least I could understand the speech.
I could share the 2500-step sample here, but since it’s in Polish, I’m not sure if any of you would understand it.
Anyway, if someone can help, I’d be very grateful for any advice.
I was never able to reploduce amazing results using flux kontext workflow. Everything it generates looks like flux schnell - old generation / midjourney / sdxl quality