Hello everyone. I am Dr. Furkan Gözükara. PhD Computer Engineer. SECourses is a dedicated YouTube channel for the following topics : Tech, AI, News, Science, Robotics, Singularity, ComfyUI, SwarmUI, ML, Artificial Intelligence, Humanoid Robots, Wan 2.2, FLUX, Krea, Qwen Image, VLMs, Stable Diffusion
I've written python scripts with chatGPT that will process the high resolution images for training by cutting them into 1024x1024 pieces with a bit of overlap.
0 runs, 0 stars, 0 downloads. This fine tuned checkpoint is based on Flux dev de-distilled thus requires a special comfyUI workflow and won't work very well ...
not really the higher the resolution the longer it takes to render obviously - and you get better results but also diminishing returns eventually - this was resized to 1080p i believe took 5min to render
i sent the workflow above - input video - play with the prompt and setting, make sure to match the skip steps with the normal scheduler step to equal 20 step of genning anything under that breaks teh genning
I already test some top Flux dev de-distilled model, quality quite dissapointed, most of them only good on realistic image, for the style only the same or even worse than Flux dev although time for render is double due to CFG >1. Are there any de-distilled models you guys can recommend and maybe can be used to train lora and finetuning ?
What would you guys recommend for a flux checkpoint training on Kohya for a character model.. i have 40 images on a 3090ti (24 gb, 64 gb of ram). More than batch size 1? Is current best practice not to caption our training sets and to use ohxw woman? Things change so fast in this world?
@Dr. Furkan Gözükara I'm currently using Flux Q8 on Kaggle , is there something better I can use for multi purpose? (realism,style etc). I saw in chat something about de distilled versions?
Anyone know the easiest way to finetune your own VLLM for image captioning? I've already got my own dataset, but there doesn't seem to be a straightforward way to actually carry out the finetuning process itself... I know there's already some good captioners out there, but I want to finetune my own
@Dr. Furkan Gözükara I've noticed that SwarmUI downloads t5xxl_enconly.safetensors file every time you try to use flux even if you set t5xxl model location manually. Furthermore there is a mention in debug logs that Comfy back-end is typecasting fp8_e4m3fn model into bf16 every time flux type checkpoint loads from scratch, and that typecasting takes significant CPU time. I looked up in swarmUi code and it seems that it's downloading model from this repo: https://huggingface.co/mcmonkey/google_t5-v1_1-xxl_encoderonly/ That mdel is 4.9 gb that is consistent with VRAM usage of swarmUI when running flux inference. My concern is that all this time we used the inferior fp8 casted t5xxl model for generations, wich according to many reports decreases output quality much more than casting to fp8 flux itself
I noticed it while researching why swapping fp16 flux models takes so much time. If my findings are correct it seems half the time is wasted on useless typecasting of t5xxl from fp8 to bf16
Notice your VRAM usage on 40+ GB vram cards while running flux fp16 inference. It should use at least 33.6 GB vram to fit both flux and t5xxl + clip, but it seems it is using around 30GB vram which is more consistent with downcasted t5xxl theory