Runpod•15mo ago

Training Flux Schnell on serverless

Hi there, i am using your pods to run ostris/ai-toolkit to train flux on custom images, the thing is now i want to use your serverless endpoint capabilities, can you help me out? do you have some kind of template or guide on how to do it?

93 Replies

navin_hariharan•14mo ago

@Untrack4d Hii! I have the dev serverless already! I'll update schnell soon

Untrack4dOP•14mo ago

Do you have some demo or can I test it out?

navin_hariharan•14mo ago

Give me 30min

Untrack4dOP•14mo ago

Ok man, thx What are you using to train it?

navin_hariharan•14mo ago

FLUX-LORA.zip

navin_hariharan•14mo ago

{ "input": { "lora_file_name": "laksheya-geraldine_viswanathan-FLUX", "trigger_word": "geraldine viswanathan", "gender":"woman", "data_url": "dataset_zip url" }, "s3Config": { "accessId": "accessId", "accessSecret": "accessSecret", "bucketName": "flux-lora", "endpointUrl": "https://minio-api.cloud.com" } } @Untrack4d

Untrack4dOP•14mo ago

Thanks for sharing I will check it out what does this image contain? FROM navinhariharan/flux-lora:latest how are you handling the long time proccess of training a model?

navin_hariharan•14mo ago

Disable this for long time proccess

navin_hariharan•14mo ago

FROM navinhariharan/flux-lora:latest These contain the flux models dev and schnell

Untrack4dOP•14mo ago

Thank you for the help 🫡

navin_hariharan•14mo ago

Anytime 🙂 So the lora is trained and sent to your s3 bucket!

Untrack4dOP•14mo ago

I will be hosting it in a server of mine to reduce costs

navin_hariharan•14mo ago

I use minio!

Untrack4dOP•14mo ago

Never heard of

navin_hariharan•14mo ago

open source s3

navin_hariharan•14mo ago

https://min.io/

MinIO

MinIO | S3 Compatible Storage for AI

MinIO's High Performance Object Storage is Open Source, Amazon S3 compatible, Kubernetes Native and is designed for cloud native workloads like AI.

Untrack4dOP•14mo ago

I will take a look

navin_hariharan•14mo ago

Sure! If you have issues let me know! I'll be happy to help!

Untrack4dOP•14mo ago

Do you have any tips to get better results? Or to make it train faster?

navin_hariharan•14mo ago

Sample dataset with default param works!

dataset.zip

navin_hariharan•14mo ago

It takes 2hours! The one in civit lora trainer is faster!

navin_hariharan•14mo ago

https://civitai.com/models/train

navin_hariharan•14mo ago

Here is a bmw I trained! https://civitai.com/models/736216/bmw-m340i-2024-lci-flux?modelVersionId=823275

BMW M340I 2024 LCI (FLUX) - V1 | Stable Diffusion LoRA | Civitai

The BMW M340I!!

Untrack4dOP•14mo ago

i was using ai-toolkit what hardware are you using?

navin_hariharan•14mo ago

Untrack4dOP•14mo ago

does it work for schneel? Is is faster then ai-toolkit?

navin_hariharan•14mo ago

You can deploy this to get started!

navin_hariharan•14mo ago

Yes! Yes! The lora size is small too without loss of quality! navinhariharan/flux-lora:latest

Untrack4dOP•14mo ago

With ai-toolkit i am getting about 30-40 min for 1000 steps

navin_hariharan•14mo ago

I do 2000 steps!

Untrack4dOP•14mo ago

ok, that makes sense are you doing some kind of image selection/preprocessing?

navin_hariharan•14mo ago

Yep! The captions!

Untrack4dOP•14mo ago

i am using florence2 for that you arent excluding low quality ones, resizing, etc?

navin_hariharan•14mo ago

The images you mean? I mix a bit of everything!

Untrack4dOP•14mo ago

i have noticed that low quality ones can completly mess your output what have you put in this image navinhariharan/flux-lora:latest i want to costumize it, can you share the source?

navin_hariharan•14mo ago

black-forest-labs/FLUX.1-schnell black-forest-labs/FLUX.1-dev These are auto downloaded by ai-toolkit! Instead of exporting env for HF_TOKEN I downloaded and made a docker image That lives here /huggingface/

Untrack4dOP•14mo ago

i want to store those models in a network volume, so it can be shared between serverless instances

navin_hariharan•14mo ago

That's the best!

Untrack4dOP•14mo ago

the thing is i didnt understood how to choose where its stored another thing: def train_lora(job): if 's3Config' in job: s3_config = job["s3Config"] job_input = job["input"] job_input = download(job_input) if edityaml(job_input) == True: if job_input['gender'].lower() in ['woman','female','girl']: job = get_job('config/woman.yaml', None) elif job_input['gender'].lower() in ['man','male','boy']: job = get_job('config/man.yaml', None) job.run() how are you able to run the job, where does the get_job function come from?

navin_hariharan•14mo ago

The handler bro!

Untrack4dOP•14mo ago

Yes but then you call job.run

navin_hariharan•14mo ago

runpod.serverless.start({"handler": train_lora}) This will call the function train_lora with the input json! that is... job = { "input": { "lora_file_name": "laksheya-geraldine_viswanathan-FLUX", "trigger_word": "geraldine viswanathan", "gender":"woman", "data_url": "dataset_zip url" }, "s3Config": { "accessId": "accessId", "accessSecret": "accessSecret", "bucketName": "flux-lora", "endpointUrl": "https://minio-api.cloud.com" } } @Untrack4d

Untrack4dOP•14mo ago

Anda where is that function? The train_lora ?

navin_hariharan•14mo ago

@Untrack4d Line 31

Untrack4dOP•14mo ago

sorry man it was a pretty stupid question, thats what i get for trying to do n things at a time ahaha

navin_hariharan•14mo ago

No issues mam! We are all learning 😄

Untrack4dOP•14mo ago

Have you managed to successfully use network volumes in serverless?

navin_hariharan•14mo ago

I've never tried them! It shouldn't be difficult though

Sandeep•14mo ago

is this due the container size And may I know what is the inference time , it taking for an image to generate on A100 or any other gpus , for me its taking 15 seconds , @navin_hariharan

navin_hariharan•14mo ago

@Sandeep what is your input? Please remove any credentials you have and send Looks like an error while downloading dataset

Sandeep•14mo ago

I am using flux and sdxl models in this deployment, When ever user sends flux lora request, I will generate of flux lora Same applies to sdxl Input is Lora blob url Modeltype What should be the container size

navin_hariharan•14mo ago

That's all fine! How are you sending in the training dataset? @Sandeep

Sandeep•14mo ago

This system doesn't need datasets , it just use the models from huggingface , it will import models from huggingface and download the lora and will use that lora for inference

navin_hariharan•14mo ago

Could you please send the worker files so that I can take a look? And also do not forget to remove sensitive info before sending!

Sandeep•14mo ago

getting this error when I am using runpod-volume

Sandeep•14mo ago

Use a more specific base image for efficiency FROM runpod/base:0.6.2-cuda12.2.0 Set environment variables ENV HF_HUB_ENABLE_HF_TRANSFER=0 \ PYTHONDONTWRITEBYTECODE=1 \ PYTHONUNBUFFERED=1 \ HF_HOME=/runpod-volume/huggingface-cache \ HUGGINGFACE_HUB_CACHE=/runpod-volume/huggingface-cache/hub \ WORKSPACE=/runpod-volume RUN ls -a / Create necessary directories RUN mkdir -p ${WORKSPACE}/app ${HF_HOME} Copy requirements first to leverage Docker cache for dependencies COPY requirements.txt ${WORKSPACE}/ Install dependencies in a single RUN statement to reduce layers RUN python3.11 -m pip install --no-cache-dir --upgrade pip && \ python3.11 -m pip install --no-cache-dir -r ${WORKSPACE}/requirements.txt && \ rm ${WORKSPACE}/requirements.txt Copy source code to /runpod-volume/app COPY test_input.json ${WORKSPACE}/app/ COPY src ${WORKSPACE}/app/src Set the working directory WORKDIR ${WORKSPACE}/app/src Use the built-in handler script from the source CMD ["python3.11", "-u", "runpod_handler.py"]

Zuck•13mo ago

@Sandeep @navin_hariharan Did you guys ever get this working, I’m trying to do the same thing with ai-toolkit. Flux dev model. Any code you can share? There are some things in your docker image @navin_hariharan id love to be able to edit thank you!! 😭😭

navin_hariharan•13mo ago

FLUX-LORA.zip

navin_hariharan•13mo ago

@Zuck I have lost the Dockerfile of https://hub.docker.com/r/navinhariharan/flux-lora/tags

Zuck•13mo ago

That’s okay ! I should be able to reverse engineer 🙂 Thank you so much!!

navin_hariharan•13mo ago

Please send it here if you have managed to do it!

Zuck•13mo ago

Deal sounds good!

navin_hariharan•13mo ago

@Zuck Are you free now?

navin_hariharan•13mo ago

Give this a test! Should work hopefully!

Dockerfile

Zuck•13mo ago

@navin_hariharan amazing okay thanks!! I uploaded the contents of the docker image to a private github, did you want me to share it with you private?

navin_hariharan•13mo ago

Here is the everything working! 🙂

FLUX-LORA.zip

navin_hariharan•13mo ago

You can make it public! No issues! Many people may get benefited! Removed unnecessary code! - It's just the models the models that the FROM is pulling! - AI toolkit will now be downloaded on this Dockerfile! TO-DO: Support the schnell config

Zuck•13mo ago

https://github.com/newideas99/flux-training-docker.git !

GitHub

GitHub - newideas99/flux-training-docker

Contribute to newideas99/flux-training-docker development by creating an account on GitHub.

Jesse•6mo ago

@Zuck @navin_hariharan I built a Docker image using this repo https://github.com/newideas99/flux-training-docker and successfully trained Lora using Runpod serverless endpoints. However, when I run the trained Lora, I get this error: "Exception: Error while deserializing header: HeaderTooLarge." I am no expert, but the Lora safetensor file might be corrupted, and the reason behind the corruption is the Docker base image "navinhariharan/fluxd-model." Any help is appreciated.
Best,
Jesse

GitHub

GitHub - newideas99/flux-training-docker

Contribute to newideas99/flux-training-docker development by creating an account on GitHub.

navin_hariharan•6mo ago

Can you please screenshot the error?

Jesse•6mo ago

thanks for your quick reply. i am using the lora.safetensors[uploaded to my s3 storage by runpod-serverless.py handler.] file on replicate.

Jesse•6mo ago

@navin_hariharan I have tried to train multiple LoRas, and I got the same errors. i tried to run this lora in comfyUI too, and it gave me same error

navin_hariharan•6mo ago

@Jesse Your request header is too large

Jesse•6mo ago

@navin_hariharan what does it mean?

navin_hariharan•6mo ago

The request you sent has a huge text/Data! Can you send me the request json you sent? Please remove credentials if you have entered any

Jesse•6mo ago

sure

Jesse•6mo ago

@navin_hariharan it would be great help if you could provide dockerfile of this image as well. navinhariharan/fluxd-model thanks

navin_hariharan•6mo ago

Can I get a full ss of these logs? I'll need to have a look! Idk where I have put it

navin_hariharan•6mo ago

https://github.com/navin-hariharan/FLUX-INFERENCE-LORA

GitHub

GitHub - navin-hariharan/FLUX-INFERENCE-LORA: Flux Inference with L...

Flux Inference with Lora - runpod worker. Contribute to navin-hariharan/FLUX-INFERENCE-LORA development by creating an account on GitHub.

Jesse•6mo ago

thank you so much navin, i appreciate it. I'll provide you the logs from desktop shorty, thanks again

Unknown User•6mo ago

Message Not Public

Jesse•6mo ago

2025-06-02 00:55:17.380 | INFO | fp8.lora_loading:restore_base_weights:600 - Unloaded 304 layers 2025-06-02 00:55:17.382 | SUCCESS | fp8.lora_loading:unload_loras:571 - LoRAs unloaded in 0.0042s free=26730077900800 Downloading weights downloading weights from https://lora-urls.co/xzy.safetensors Downloaded weights in 8.33s 2025-06-02 00:55:25.713 | INFO | fp8.lora_loading:convert_lora_weights:502 - Loading LoRA weights for /src/weights-cache/f14ea1f2c70aca45 Traceback (most recent call last): File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/cog/server/worker.py", line 352, in _predict result = predict(payload) ^^^^^^^^^^^^^^^^^^ File "/src/predict.py", line 566, in predict model.handle_loras( File "/src/bfl_predictor.py", line 118, in handle_loras load_lora(model, lora_path, lora_scale, self.store_clones) File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/src/fp8/lora_loading.py", line 543, in load_lora lora_weights = convert_lora_weights(lora_path, has_guidance) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/src/fp8/lora_loading.py", line 503, in convert_lora_weights lora_weights = load_file(lora_path, device="cuda") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/safetensors/torch.py", line 311, in load_file with safe_open(filename, framework="pt", device=device) as f: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge @navin_hariharan I have pasted logs from the replicate here. @navin_hariharan I think the GitHub repo is related to ComfyUI, not with the "navinhariharan/fluxd-model", which I requested. @Jason The trained Lora was uploaded via script[worker] to my S3 bucket, and I am loading it via URL into Replicate Inference.

Unknown User•6mo ago

Message Not Public

Jesse•6mo ago

@ You mean the trained Lora is corrupted, right?

Unknown User•6mo ago

Message Not Public

Jesse•6mo ago

@Jason thanks, I will check this out. @Jason I have verified and found that the downloading process doesn't make any difference to the file. Hashes match, so my Docker image is the culprit

Unknown User•6mo ago

Message Not Public

Jesse•6mo ago

no, it's not working anywhere I tried, over replicate and in comfyUI as well, and both gave me the same error. I used repo and tweaked it a bit for my use case, I think the issue lies in the base image 'navinhariharan/fluxd-model" since the layer image doesn't hold anything related to the training process itself, https://github.com/newideas99/flux-training-docker i also tried to build an image from scratch, but that didn't work. 😥

Unknown User•6mo ago

Message Not Public

Jesse•6mo ago

the flux files were corrupted, so I had to start from scratch, and it worked. thanks @Jason @navin_hariharan for your help

Unknown User•6mo ago

Message Not Public

navin_hariharan•6mo ago

Anytime! Glad you got it to work 😁

Gaming

Programming

Training Flux Schnell on serverless

Did you find this page helpful?