Software Engineering Courses (SECourses)•11mo ago

Anyone know the easiest way to finetune your own VLLM for image captioning? I've already got my own

Anyone know the easiest way to finetune your own VLLM for image captioning? I've already got my own dataset, but there doesn't seem to be a straightforward way to actually carry out the finetuning process itself... I know there's already some good captioners out there, but I want to finetune my own

Timson•2/8/25, 2:12 PM

@Dr. Furkan Gözükara I've noticed that SwarmUI downloads t5xxl_enconly.safetensors file every time you try to use flux even if you set t5xxl model location manually.
Furthermore there is a mention in debug logs that Comfy back-end is typecasting fp8_e4m3fn model into bf16 every time flux type checkpoint loads from scratch, and that typecasting takes significant CPU time.
I looked up in swarmUi code and it seems that it's downloading model from this repo: https://huggingface.co/mcmonkey/google_t5-v1_1-xxl_encoderonly/ That mdel is 4.9 gb that is consistent with VRAM usage of swarmUI when running flux inference.
My concern is that all this time we used the inferior fp8 casted t5xxl model for generations, wich according to many reports decreases output quality much more than casting to fp8 flux itself

mcmonkey/google_t5-v1_1-xxl_encoderonly · Hugging Face

Timson•2/8/25, 2:17 PM

It also checks model file hash so just replacing file with the same name would't work

Timson•2/8/25, 2:21 PM

I noticed it while researching why swapping fp16 flux models takes so much time. If my findings are correct it seems half the time is wasted on useless typecasting of t5xxl from fp8 to bf16

Timson•2/8/25, 2:26 PM

Notice your VRAM usage on 40+ GB vram cards while running flux fp16 inference. It should use at least 33.6 GB vram to fit both flux and t5xxl + clip, but it seems it is using around 30GB vram which is more consistent with downcasted t5xxl theory

julius•2/8/25, 2:37 PM

interesting! so what should one do?

Timson•2/8/25, 2:43 PM

Validate this hyposethis
Report to SwarmUI repo issues
Maybe look up in the code and suggest fix as MR to the repo

Timson•2/8/25, 2:51 PM

Maybe it was just my who idiotically named the file t5xxl_enconly hoping to avoid downloading it and resulting in in being overwritten every time

Timson•2/8/25, 2:57 PM

yep, seems like a false alarm, this only affected me as the file was overwritten after I downloaded it by swamui

TTimson Is Hunyuan current state of the art img2vid model?

Furkan Gözükara SECourses•2/8/25, 5:19 PM

it is text to vid

Gghstoic Anyone know the easiest way to finetune your own VLLM for image captioning? I've...

Furkan Gözükara SECourses•2/8/25, 5:19 PM

good question i never saw anyone did

TTimson @Dr. Furkan Gözükara I've noticed that SwarmUI downloads t5xxl_enconly.safetenso...

Furkan Gözükara SECourses•2/8/25, 5:19 PM

yes he does that

Furkan Gözükara SECourses•2/8/25, 5:19 PM

i reported but he didnt change so far

TTimson Maybe it was just my who idiotically named the file t5xxl_enconly hoping to avoi...

Furkan Gözükara SECourses•2/8/25, 5:22 PM

you can download fp16

Furkan Gözükara SECourses•2/8/25, 5:22 PM

rename

Furkan Gözükara SECourses•2/8/25, 5:22 PM

it wont download

VioletAnt•2/8/25, 6:49 PM

https://youtu.be/FvpWy1x5etM?t=4719

here at this part its said something that you cannot update kohya with the updater but should use installers instead? is this still the case? thx!

YouTubeSECourses

FLUX Full Fine-Tuning / DreamBooth Training Master Tutorial for Win...

If you want to train FLUX with maximum possible quality, this is the tutorial looking for. In this comprehensive tutorial, you will learn how to install Kohya GUI and use it to fully Fine-Tune / DreamBooth FLUX model. After that how to use SwarmUI to compare generated checkpoints / models and find the very best one to generate most amazing image...

FFurkan Gözükara SECourses good question i never saw anyone did

ghstoicOP•2/8/25, 11:59 PM

Reckon it's something you'd be interested in taking a swing at? A nice wrapper would make things so much easier, but that's beyond me

Ruairi Robinson•2/9/25, 12:27 AM

https://youtu.be/WrK_DUKXMyY

YouTubeRuairi Robinson

WHICH SIDE ARE YOU ON

AI is coming for your job. Which side are you on?
Music: Pete Seeger.
Directed by Ruairi Robinson
Made using Veo 2

Ruairi Robinson•2/9/25, 12:27 AM

made another thing with VEO 2

VioletAnt•2/9/25, 12:46 AM

@Furkan Gözükara SECourses in flux finetuning I saw you used 100 epochs for 256 images. so if you have for example 350 images, then what epochs should you choose? or should you reduce images to amount of 256?

RRuairi Robinson https://youtu.be/WrK_DUKXMyY

VioletAnt•2/9/25, 12:55 AM

very cool stuff. how long does it take you to create such a video?

VVioletAnt https://youtu.be/FvpWy1x5etM?t=4719 here at this part its said something that y...

Furkan Gözükara SECourses•2/9/25, 1:10 AM

yes still not in main branch

VVioletAnt @Furkan Gözükara SECourses in flux finetuning I saw you used 100 epochs for 256 ...

Furkan Gözükara SECourses•2/9/25, 1:11 AM

probably max 50

Furkan Gözükara SECourses•2/9/25, 1:11 AM

or 75

FFurkan Gözükara SECourses probably max 50

VioletAnt•2/9/25, 1:33 AM

ahhh ok I think I understand. more images higher quality and less epochs needed

VioletAnt•2/9/25, 1:36 AM

@Furkan Gözükara SECourses batch size 1 is better quality? how much percent do you guess?
but its worse in stylized? what is actually stylized?

I looked at the comparison sets but dont see any differences

VioletAnt•2/9/25, 1:56 AM

@Furkan Gözükara SECourses with batch size 1 and fine tuning can both GPUs be used? seems like it is not used

VVioletAnt very cool stuff. how long does it take you to create such a video?

Ruairi Robinson•2/9/25, 9:08 AM

this was 2 days

RRuairi Robinson this was 2 days

VioletAnt•2/9/25, 11:52 AM

2 days? what is all involved in such a process and steps is most time spent?
what do you think about this tool or gen AI in general? (because I saw in your channel you do movies)

AiInfluence•2/9/25, 12:26 PM

@Furkan Gözükara SECourses SOS

I need suggestions for good lip sync

VioletAnt•2/9/25, 1:32 PM

@Furkan Gözükara SECourses
in video you say epoch 100 is good for 256 images... but after you mention that epoch 50 is best?

what I dont understand yet... so while training you can already move the generated checkpoints to some different place because these are kind of finished states of the trained model and will only be needed for training in case the training fails. which would mean you could then use the checkpoint as new starting point? is this correct?

1nfern0•2/9/25, 5:24 PM

I've read some comments about using a clip interogator to filter out similar images from a dataset. Has anyone used this technique or know of a tool that could do this?

Vexmachina•2/9/25, 5:39 PM

Is anyone aware of a platform that offers the ability to train a flux lora and then use something like adetailer/udetailer to fix the face on inference, like out of the box or with minimal setup?

Not looking to set up my own thing from scratch on something like runpod, hoping for something more plug and play like Replicate

AIGambino•2/9/25, 8:09 PM

https://xh9998.github.io/DiffVSR-project/

DiffVSR

..tarkan Little comparison to see the improvement side by side.

Macsalim•2/9/25, 9:17 PM

Thank you for the Flux workflow, it looks promising. But I have a question I keep getting a "mat1 mat2 mismatch" error for Clip_G looks like I'm using the wrong version, can you share the link to the clip file you're using?
Also, it would be nice if you can share the generation data for one of your images, just for me to check if everything is working correctly.

MMacsalim Thank you for the Flux workflow, it looks promising. But I have a question I kee...

.tarkan•2/9/25, 10:19 PM

Here the clip g model. https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/text_encoders/clip_g.safetensors?download=true

Mido•2/10/25, 2:52 AM

Hello

Mido•2/10/25, 2:53 AM

Whenever i train kohyass lora on a flux model, i get error "returned non-zero exit status 1."

Mido•2/10/25, 2:53 AM

other models it works perfectly fine

Aragon•2/10/25, 12:59 PM

I'm getting two additional options for Kohya that don't appear on the exapmle image on the guide on Patreon - "Conv Dimension" and "Minimum Difference". Should I leave them with default values?