R
Runpodβ€’15mo ago
Untrack4d

Training Flux Schnell on serverless

Hi there, i am using your pods to run ostris/ai-toolkit to train flux on custom images, the thing is now i want to use your serverless endpoint capabilities, can you help me out? do you have some kind of template or guide on how to do it?
93 Replies
navin_hariharan
navin_hariharanβ€’14mo ago
@Untrack4d Hii! I have the dev serverless already! I'll update schnell soon
Untrack4d
Untrack4dOPβ€’14mo ago
Do you have some demo or can I test it out?
navin_hariharan
navin_hariharanβ€’14mo ago
Give me 30min
Untrack4d
Untrack4dOPβ€’14mo ago
Ok man, thx What are you using to train it?
navin_hariharan
navin_hariharanβ€’14mo ago
{ "input": { "lora_file_name": "laksheya-geraldine_viswanathan-FLUX", "trigger_word": "geraldine viswanathan", "gender":"woman", "data_url": "dataset_zip url" }, "s3Config": { "accessId": "accessId", "accessSecret": "accessSecret", "bucketName": "flux-lora", "endpointUrl": "https://minio-api.cloud.com" } } @Untrack4d
Untrack4d
Untrack4dOPβ€’14mo ago
Thanks for sharing I will check it out what does this image contain? FROM navinhariharan/flux-lora:latest how are you handling the long time proccess of training a model?
navin_hariharan
navin_hariharanβ€’14mo ago
Disable this for long time proccess
No description
navin_hariharan
navin_hariharanβ€’14mo ago
FROM navinhariharan/flux-lora:latest These contain the flux models dev and schnell
Untrack4d
Untrack4dOPβ€’14mo ago
Thank you for the help 🫑
navin_hariharan
navin_hariharanβ€’14mo ago
Anytime πŸ™‚ So the lora is trained and sent to your s3 bucket!
Untrack4d
Untrack4dOPβ€’14mo ago
I will be hosting it in a server of mine to reduce costs
navin_hariharan
navin_hariharanβ€’14mo ago
I use minio!
Untrack4d
Untrack4dOPβ€’14mo ago
Never heard of
navin_hariharan
navin_hariharanβ€’14mo ago
open source s3
navin_hariharan
navin_hariharanβ€’14mo ago
MinIO
MinIO | S3 Compatible Storage for AI
MinIO's High Performance Object Storage is Open Source, Amazon S3 compatible, Kubernetes Native and is designed for cloud native workloads like AI.
Untrack4d
Untrack4dOPβ€’14mo ago
I will take a look
navin_hariharan
navin_hariharanβ€’14mo ago
Sure! If you have issues let me know! I'll be happy to help!
Untrack4d
Untrack4dOPβ€’14mo ago
Do you have any tips to get better results? Or to make it train faster?
navin_hariharan
navin_hariharanβ€’14mo ago
Sample dataset with default param works!
navin_hariharan
navin_hariharanβ€’14mo ago
It takes 2hours! The one in civit lora trainer is faster!
Untrack4d
Untrack4dOPβ€’14mo ago
i was using ai-toolkit what hardware are you using?
navin_hariharan
navin_hariharanβ€’14mo ago
No description
Untrack4d
Untrack4dOPβ€’14mo ago
does it work for schneel? Is is faster then ai-toolkit?
navin_hariharan
navin_hariharanβ€’14mo ago
You can deploy this to get started!
No description
navin_hariharan
navin_hariharanβ€’14mo ago
Yes! Yes! The lora size is small too without loss of quality! navinhariharan/flux-lora:latest
Untrack4d
Untrack4dOPβ€’14mo ago
With ai-toolkit i am getting about 30-40 min for 1000 steps
navin_hariharan
navin_hariharanβ€’14mo ago
I do 2000 steps!
Untrack4d
Untrack4dOPβ€’14mo ago
ok, that makes sense are you doing some kind of image selection/preprocessing?
navin_hariharan
navin_hariharanβ€’14mo ago
Yep! The captions!
Untrack4d
Untrack4dOPβ€’14mo ago
i am using florence2 for that you arent excluding low quality ones, resizing, etc?
navin_hariharan
navin_hariharanβ€’14mo ago
The images you mean? I mix a bit of everything!
Untrack4d
Untrack4dOPβ€’14mo ago
i have noticed that low quality ones can completly mess your output what have you put in this image navinhariharan/flux-lora:latest i want to costumize it, can you share the source?
navin_hariharan
navin_hariharanβ€’14mo ago
black-forest-labs/FLUX.1-schnell black-forest-labs/FLUX.1-dev These are auto downloaded by ai-toolkit! Instead of exporting env for HF_TOKEN I downloaded and made a docker image That lives here /huggingface/
Untrack4d
Untrack4dOPβ€’14mo ago
i want to store those models in a network volume, so it can be shared between serverless instances
navin_hariharan
navin_hariharanβ€’14mo ago
That's the best!
Untrack4d
Untrack4dOPβ€’14mo ago
the thing is i didnt understood how to choose where its stored another thing: def train_lora(job): if 's3Config' in job: s3_config = job["s3Config"] job_input = job["input"] job_input = download(job_input) if edityaml(job_input) == True: if job_input['gender'].lower() in ['woman','female','girl']: job = get_job('config/woman.yaml', None) elif job_input['gender'].lower() in ['man','male','boy']: job = get_job('config/man.yaml', None) job.run() how are you able to run the job, where does the get_job function come from?
navin_hariharan
navin_hariharanβ€’14mo ago
The handler bro!
Untrack4d
Untrack4dOPβ€’14mo ago
Yes but then you call job.run
navin_hariharan
navin_hariharanβ€’14mo ago
runpod.serverless.start({"handler": train_lora}) This will call the function train_lora with the input json! that is... job = { "input": { "lora_file_name": "laksheya-geraldine_viswanathan-FLUX", "trigger_word": "geraldine viswanathan", "gender":"woman", "data_url": "dataset_zip url" }, "s3Config": { "accessId": "accessId", "accessSecret": "accessSecret", "bucketName": "flux-lora", "endpointUrl": "https://minio-api.cloud.com" } } @Untrack4d
Untrack4d
Untrack4dOPβ€’14mo ago
Anda where is that function? The train_lora ?
navin_hariharan
navin_hariharanβ€’14mo ago
@Untrack4d Line 31
No description
Untrack4d
Untrack4dOPβ€’14mo ago
sorry man it was a pretty stupid question, thats what i get for trying to do n things at a time ahaha
navin_hariharan
navin_hariharanβ€’14mo ago
No issues mam! We are all learning πŸ˜„
Untrack4d
Untrack4dOPβ€’14mo ago
Have you managed to successfully use network volumes in serverless?
navin_hariharan
navin_hariharanβ€’14mo ago
I've never tried them! It shouldn't be difficult though
Sandeep
Sandeepβ€’14mo ago
is this due the container size And may I know what is the inference time , it taking for an image to generate on A100 or any other gpus , for me its taking 15 seconds , @navin_hariharan
navin_hariharan
navin_hariharanβ€’14mo ago
@Sandeep what is your input? Please remove any credentials you have and send Looks like an error while downloading dataset
Sandeep
Sandeepβ€’14mo ago
I am using flux and sdxl models in this deployment, When ever user sends flux lora request, I will generate of flux lora Same applies to sdxl Input is Lora blob url Modeltype What should be the container size
navin_hariharan
navin_hariharanβ€’14mo ago
That's all fine! How are you sending in the training dataset? @Sandeep
Sandeep
Sandeepβ€’14mo ago
This system doesn't need datasets , it just use the models from huggingface , it will import models from huggingface and download the lora and will use that lora for inference
navin_hariharan
navin_hariharanβ€’14mo ago
Could you please send the worker files so that I can take a look? And also do not forget to remove sensitive info before sending!
Sandeep
Sandeepβ€’14mo ago
getting this error when I am using runpod-volume
No description
Sandeep
Sandeepβ€’14mo ago
Use a more specific base image for efficiency FROM runpod/base:0.6.2-cuda12.2.0 Set environment variables ENV HF_HUB_ENABLE_HF_TRANSFER=0 \ PYTHONDONTWRITEBYTECODE=1 \ PYTHONUNBUFFERED=1 \ HF_HOME=/runpod-volume/huggingface-cache \ HUGGINGFACE_HUB_CACHE=/runpod-volume/huggingface-cache/hub \ WORKSPACE=/runpod-volume RUN ls -a / Create necessary directories RUN mkdir -p ${WORKSPACE}/app ${HF_HOME} Copy requirements first to leverage Docker cache for dependencies COPY requirements.txt ${WORKSPACE}/ Install dependencies in a single RUN statement to reduce layers RUN python3.11 -m pip install --no-cache-dir --upgrade pip && \ python3.11 -m pip install --no-cache-dir -r ${WORKSPACE}/requirements.txt && \ rm ${WORKSPACE}/requirements.txt Copy source code to /runpod-volume/app COPY test_input.json ${WORKSPACE}/app/ COPY src ${WORKSPACE}/app/src Set the working directory WORKDIR ${WORKSPACE}/app/src Use the built-in handler script from the source CMD ["python3.11", "-u", "runpod_handler.py"]
Zuck
Zuckβ€’13mo ago
@Sandeep @navin_hariharan Did you guys ever get this working, I’m trying to do the same thing with ai-toolkit. Flux dev model. Any code you can share? There are some things in your docker image @navin_hariharan id love to be able to edit thank you!! 😭😭
navin_hariharan
navin_hariharanβ€’13mo ago
@Zuck I have lost the Dockerfile of https://hub.docker.com/r/navinhariharan/flux-lora/tags
Zuck
Zuckβ€’13mo ago
That’s okay ! I should be able to reverse engineer πŸ™‚ Thank you so much!!
navin_hariharan
navin_hariharanβ€’13mo ago
Please send it here if you have managed to do it!
Zuck
Zuckβ€’13mo ago
Deal sounds good!
navin_hariharan
navin_hariharanβ€’13mo ago
@Zuck Are you free now?
navin_hariharan
navin_hariharanβ€’13mo ago
Give this a test! Should work hopefully!
Zuck
Zuckβ€’13mo ago
@navin_hariharan amazing okay thanks!! I uploaded the contents of the docker image to a private github, did you want me to share it with you private?
navin_hariharan
navin_hariharanβ€’13mo ago
Here is the everything working! πŸ™‚
navin_hariharan
navin_hariharanβ€’13mo ago
You can make it public! No issues! Many people may get benefited! Removed unnecessary code! - It's just the models the models that the FROM is pulling! - AI toolkit will now be downloaded on this Dockerfile! TO-DO: Support the schnell config
Zuck
Zuckβ€’13mo ago
GitHub
GitHub - newideas99/flux-training-docker
Contribute to newideas99/flux-training-docker development by creating an account on GitHub.
Jesse
Jesseβ€’6mo ago
@Zuck @navin_hariharan I built a Docker image using this repo https://github.com/newideas99/flux-training-docker and successfully trained Lora using Runpod serverless endpoints. However, when I run the trained Lora, I get this error: "Exception: Error while deserializing header: HeaderTooLarge." I am no expert, but the Lora safetensor file might be corrupted, and the reason behind the corruption is the Docker base image "navinhariharan/fluxd-model." Any help is appreciated.
Best,
Jesse
GitHub
GitHub - newideas99/flux-training-docker
Contribute to newideas99/flux-training-docker development by creating an account on GitHub.
navin_hariharan
navin_hariharanβ€’6mo ago
Can you please screenshot the error?
Jesse
Jesseβ€’6mo ago
thanks for your quick reply. i am using the lora.safetensors[uploaded to my s3 storage by runpod-serverless.py handler.] file on replicate.
No description
Jesse
Jesseβ€’6mo ago
@navin_hariharan I have tried to train multiple LoRas, and I got the same errors. i tried to run this lora in comfyUI too, and it gave me same error
navin_hariharan
navin_hariharanβ€’6mo ago
@Jesse Your request header is too large
Jesse
Jesseβ€’6mo ago
@navin_hariharan what does it mean?
navin_hariharan
navin_hariharanβ€’6mo ago
The request you sent has a huge text/Data! Can you send me the request json you sent? Please remove credentials if you have entered any
Jesse
Jesseβ€’6mo ago
sure
Jesse
Jesseβ€’6mo ago
No description
Jesse
Jesseβ€’6mo ago
@navin_hariharan it would be great help if you could provide dockerfile of this image as well. navinhariharan/fluxd-model thanks
navin_hariharan
navin_hariharanβ€’6mo ago
Can I get a full ss of these logs? I'll need to have a look! Idk where I have put it
navin_hariharan
navin_hariharanβ€’6mo ago
GitHub
GitHub - navin-hariharan/FLUX-INFERENCE-LORA: Flux Inference with L...
Flux Inference with Lora - runpod worker. Contribute to navin-hariharan/FLUX-INFERENCE-LORA development by creating an account on GitHub.
Jesse
Jesseβ€’6mo ago
thank you so much navin, i appreciate it. I'll provide you the logs from desktop shorty, thanks again
Unknown User
Unknown Userβ€’6mo ago
Message Not Public
Sign In & Join Server To View
Jesse
Jesseβ€’6mo ago
2025-06-02 00:55:17.380 | INFO | fp8.lora_loading:restore_base_weights:600 - Unloaded 304 layers 2025-06-02 00:55:17.382 | SUCCESS | fp8.lora_loading:unload_loras:571 - LoRAs unloaded in 0.0042s free=26730077900800 Downloading weights downloading weights from https://lora-urls.co/xzy.safetensors Downloaded weights in 8.33s 2025-06-02 00:55:25.713 | INFO | fp8.lora_loading:convert_lora_weights:502 - Loading LoRA weights for /src/weights-cache/f14ea1f2c70aca45 Traceback (most recent call last): File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/cog/server/worker.py", line 352, in _predict result = predict(payload) ^^^^^^^^^^^^^^^^^^ File "/src/predict.py", line 566, in predict model.handle_loras( File "/src/bfl_predictor.py", line 118, in handle_loras load_lora(model, lora_path, lora_scale, self.store_clones) File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/src/fp8/lora_loading.py", line 543, in load_lora lora_weights = convert_lora_weights(lora_path, has_guidance) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/src/fp8/lora_loading.py", line 503, in convert_lora_weights lora_weights = load_file(lora_path, device="cuda") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/safetensors/torch.py", line 311, in load_file with safe_open(filename, framework="pt", device=device) as f: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge @navin_hariharan I have pasted logs from the replicate here. @navin_hariharan I think the GitHub repo is related to ComfyUI, not with the "navinhariharan/fluxd-model", which I requested. @Jason The trained Lora was uploaded via script[worker] to my S3 bucket, and I am loading it via URL into Replicate Inference.
Unknown User
Unknown Userβ€’6mo ago
Message Not Public
Sign In & Join Server To View
Jesse
Jesseβ€’6mo ago
@ You mean the trained Lora is corrupted, right?
Unknown User
Unknown Userβ€’6mo ago
Message Not Public
Sign In & Join Server To View
Jesse
Jesseβ€’6mo ago
@Jason thanks, I will check this out. @Jason I have verified and found that the downloading process doesn't make any difference to the file. Hashes match, so my Docker image is the culprit
Unknown User
Unknown Userβ€’6mo ago
Message Not Public
Sign In & Join Server To View
Jesse
Jesseβ€’6mo ago
no, it's not working anywhere I tried, over replicate and in comfyUI as well, and both gave me the same error. I used repo and tweaked it a bit for my use case, I think the issue lies in the base image 'navinhariharan/fluxd-model" since the layer image doesn't hold anything related to the training process itself, https://github.com/newideas99/flux-training-docker i also tried to build an image from scratch, but that didn't work. πŸ˜₯
Unknown User
Unknown Userβ€’6mo ago
Message Not Public
Sign In & Join Server To View
Jesse
Jesseβ€’6mo ago
the flux files were corrupted, so I had to start from scratch, and it worked. thanks @Jason @navin_hariharan for your help
Unknown User
Unknown Userβ€’6mo ago
Message Not Public
Sign In & Join Server To View
navin_hariharan
navin_hariharanβ€’6mo ago
Anytime! Glad you got it to work 😁

Did you find this page helpful?