Training Flux Schnell on serverless
Hi there, i am using your pods to run ostris/ai-toolkit to train flux on custom images, the thing is now i want to use your serverless endpoint capabilities, can you help me out? do you have some kind of template or guide on how to do it?
93 Replies
@Untrack4d Hii!
I have the dev serverless already! I'll update schnell soon
Do you have some demo or can I test it out?
Give me 30min
Ok man, thx
What are you using to train it?
{
"input": {
"lora_file_name": "laksheya-geraldine_viswanathan-FLUX",
"trigger_word": "geraldine viswanathan",
"gender":"woman",
"data_url": "dataset_zip url"
},
"s3Config": {
"accessId": "accessId",
"accessSecret": "accessSecret",
"bucketName": "flux-lora",
"endpointUrl": "https://minio-api.cloud.com"
}
}
@Untrack4d
Thanks for sharing I will check it out
what does this image contain?
FROM navinhariharan/flux-lora:latest
how are you handling the long time proccess of training a model?
Disable this for long time proccess

FROM navinhariharan/flux-lora:latest
These contain the flux models dev and schnell
Thank you for the help π«‘
Anytime π
So the lora is trained and sent to your s3 bucket!
I will be hosting it in a server of mine to reduce costs
I use minio!
Never heard of
open source s3
MinIO
MinIO | S3 Compatible Storage for AI
MinIO's High Performance Object Storage is Open Source, Amazon S3 compatible, Kubernetes Native and is designed for cloud native workloads like AI.
I will take a look
Sure! If you have issues let me know! I'll be happy to help!
Do you have any tips to get better results?
Or to make it train faster?
Sample dataset with default param works!
It takes 2hours!
The one in civit lora trainer is faster!
Here is a bmw I trained!
https://civitai.com/models/736216/bmw-m340i-2024-lci-flux?modelVersionId=823275
i was using ai-toolkit
what hardware are you using?

does it work for schneel? Is is faster then ai-toolkit?
You can deploy this to get started!

Yes! Yes!
The lora size is small too without loss of quality!
navinhariharan/flux-lora:latest
With ai-toolkit i am getting about 30-40 min for 1000 steps
I do 2000 steps!
ok, that makes sense
are you doing some kind of image selection/preprocessing?
Yep! The captions!
i am using florence2 for that
you arent excluding low quality ones, resizing, etc?
The images you mean?
I mix a bit of everything!
i have noticed that low quality ones can completly mess your output
what have you put in this image navinhariharan/flux-lora:latest i want to costumize it, can you share the source?
black-forest-labs/FLUX.1-schnell
black-forest-labs/FLUX.1-dev
These are auto downloaded by ai-toolkit! Instead of exporting env for HF_TOKEN
I downloaded and made a docker image
That lives here
/huggingface/
i want to store those models in a network volume, so it can be shared between serverless instances
That's the best!
the thing is i didnt understood how to choose where its stored
another thing:
def train_lora(job):
if 's3Config' in job:
s3_config = job["s3Config"]
job_input = job["input"]
job_input = download(job_input)
if edityaml(job_input) == True:
if job_input['gender'].lower() in ['woman','female','girl']:
job = get_job('config/woman.yaml', None)
elif job_input['gender'].lower() in ['man','male','boy']:
job = get_job('config/man.yaml', None)
job.run()
how are you able to run the job, where does the get_job function come from?
The handler bro!
Yes but then you call job.run
runpod.serverless.start({"handler": train_lora})
This will call the function train_lora with the input json! that is...
job = {
"input": {
"lora_file_name": "laksheya-geraldine_viswanathan-FLUX",
"trigger_word": "geraldine viswanathan",
"gender":"woman",
"data_url": "dataset_zip url"
},
"s3Config": {
"accessId": "accessId",
"accessSecret": "accessSecret",
"bucketName": "flux-lora",
"endpointUrl": "https://minio-api.cloud.com"
}
}
@Untrack4d
Anda where is that function?
The train_lora ?
@Untrack4d Line 31

sorry man it was a pretty stupid question, thats what i get for trying to do n things at a time ahaha
No issues mam! We are all learning π
Have you managed to successfully use network volumes in serverless?
I've never tried them! It shouldn't be difficult though
is this due the container size
And may I know what is the inference time , it taking for an image to generate on A100 or any other gpus , for me its taking 15 seconds
,
@navin_hariharan
@Sandeep what is your input?
Please remove any credentials you have and send
Looks like an error while downloading dataset
I am using flux and sdxl models in this deployment,
When ever user sends flux lora request, I will generate of flux lora
Same applies to sdxl
Input is
Lora blob url
Modeltype
What should be the container size
That's all fine!
How are you sending in the training dataset?
@Sandeep
This system doesn't need datasets , it just use the models from huggingface , it will import models from huggingface and download the lora and will use that lora for inference
Could you please send the worker files so that I can take a look?
And also do not forget to remove sensitive info before sending!
getting this error when I am using runpod-volume

Use a more specific base image for efficiency
FROM runpod/base:0.6.2-cuda12.2.0
Set environment variables
ENV HF_HUB_ENABLE_HF_TRANSFER=0 \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
HF_HOME=/runpod-volume/huggingface-cache \
HUGGINGFACE_HUB_CACHE=/runpod-volume/huggingface-cache/hub \
WORKSPACE=/runpod-volume
RUN ls -a /
Create necessary directories
RUN mkdir -p ${WORKSPACE}/app ${HF_HOME}
Copy requirements first to leverage Docker cache for dependencies
COPY requirements.txt ${WORKSPACE}/
Install dependencies in a single RUN statement to reduce layers
RUN python3.11 -m pip install --no-cache-dir --upgrade pip && \
python3.11 -m pip install --no-cache-dir -r ${WORKSPACE}/requirements.txt && \
rm ${WORKSPACE}/requirements.txt
Copy source code to /runpod-volume/app
COPY test_input.json ${WORKSPACE}/app/
COPY src ${WORKSPACE}/app/src
Set the working directory
WORKDIR ${WORKSPACE}/app/src
Use the built-in handler script from the source
CMD ["python3.11", "-u", "runpod_handler.py"]
@Sandeep @navin_hariharan
Did you guys ever get this working, Iβm trying to do the same thing with ai-toolkit. Flux dev model.
Any code you can share? There are some things in your docker image @navin_hariharan id love to be able to edit
thank you!! ππ
@Zuck
I have lost the Dockerfile of
https://hub.docker.com/r/navinhariharan/flux-lora/tags
Thatβs okay ! I should be able to reverse engineer π
Thank you so much!!
Please send it here if you have managed to do it!
Deal sounds good!
@Zuck Are you free now?
Give this a test! Should work hopefully!
@navin_hariharan amazing okay thanks!!
I uploaded the contents of the docker image to a private github, did you want me to share it with you private?
Here is the everything working!
π
You can make it public! No issues! Many people may get benefited!
Removed unnecessary code!
- It's just the models the models that the FROM is pulling!
- AI toolkit will now be downloaded on this Dockerfile!
TO-DO:
Support the schnell config
GitHub
GitHub - newideas99/flux-training-docker
Contribute to newideas99/flux-training-docker development by creating an account on GitHub.
@Zuck @navin_hariharan I built a Docker image using this repo https://github.com/newideas99/flux-training-docker and successfully trained Lora using Runpod serverless endpoints. However, when I run the trained Lora, I get this error: "Exception: Error while deserializing header: HeaderTooLarge." I am no expert, but the Lora safetensor file might be corrupted, and the reason behind the corruption is the Docker base image "navinhariharan/fluxd-model."
Any help is appreciated.
Best,
Jesse
Best,
Jesse
GitHub
GitHub - newideas99/flux-training-docker
Contribute to newideas99/flux-training-docker development by creating an account on GitHub.
Can you please screenshot the error?
thanks for your quick reply. i am using the lora.safetensors[uploaded to my s3 storage by runpod-serverless.py handler.] file on replicate.

@navin_hariharan I have tried to train multiple LoRas, and I got the same errors.
i tried to run this lora in comfyUI too, and it gave me same error
@Jesse Your request header is too large
@navin_hariharan what does it mean?
The request you sent has a huge text/Data!
Can you send me the request json you sent? Please remove credentials if you have entered any
sure

@navin_hariharan it would be great help if you could provide dockerfile of this image as well. navinhariharan/fluxd-model
thanks
Can I get a full ss of these logs?
I'll need to have a look! Idk where I have put it
GitHub
GitHub - navin-hariharan/FLUX-INFERENCE-LORA: Flux Inference with L...
Flux Inference with Lora - runpod worker. Contribute to navin-hariharan/FLUX-INFERENCE-LORA development by creating an account on GitHub.
thank you so much navin, i appreciate it. I'll provide you the logs from desktop shorty, thanks again
Unknown Userβ’6mo ago
Message Not Public
Sign In & Join Server To View
2025-06-02 00:55:17.380 | INFO | fp8.lora_loading:restore_base_weights:600 - Unloaded 304 layers
2025-06-02 00:55:17.382 | SUCCESS | fp8.lora_loading:unload_loras:571 - LoRAs unloaded in 0.0042s
free=26730077900800
Downloading weights
downloading weights from https://lora-urls.co/xzy.safetensors
Downloaded weights in 8.33s
2025-06-02 00:55:25.713 | INFO | fp8.lora_loading:convert_lora_weights:502 - Loading LoRA weights for /src/weights-cache/f14ea1f2c70aca45
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/cog/server/worker.py", line 352, in _predict
result = predict(payload)
^^^^^^^^^^^^^^^^^^
File "/src/predict.py", line 566, in predict
model.handle_loras(
File "/src/bfl_predictor.py", line 118, in handle_loras
load_lora(model, lora_path, lora_scale, self.store_clones)
File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/src/fp8/lora_loading.py", line 543, in load_lora
lora_weights = convert_lora_weights(lora_path, has_guidance)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/src/fp8/lora_loading.py", line 503, in convert_lora_weights
lora_weights = load_file(lora_path, device="cuda")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/safetensors/torch.py", line 311, in load_file
with safe_open(filename, framework="pt", device=device) as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
@navin_hariharan I have pasted logs from the replicate here.
@navin_hariharan I think the GitHub repo is related to ComfyUI, not with the "navinhariharan/fluxd-model", which I requested.
@Jason The trained Lora was uploaded via script[worker] to my S3 bucket, and I am loading it via URL into Replicate Inference.
Unknown Userβ’6mo ago
Message Not Public
Sign In & Join Server To View
@ You mean the trained Lora is corrupted, right?
Unknown Userβ’6mo ago
Message Not Public
Sign In & Join Server To View
@Jason thanks, I will check this out.
@Jason I have verified and found that the downloading process doesn't make any difference to the file.
Hashes match, so my Docker image is the culprit
Unknown Userβ’6mo ago
Message Not Public
Sign In & Join Server To View
no, it's not working anywhere I tried, over replicate and in comfyUI as well, and both gave me the same error.
I used repo and tweaked it a bit for my use case, I think the issue lies in the base image 'navinhariharan/fluxd-model" since the layer image doesn't hold anything related to the training process itself,
https://github.com/newideas99/flux-training-docker
i also tried to build an image from scratch, but that didn't work. π₯
Unknown Userβ’6mo ago
Message Not Public
Sign In & Join Server To View
the flux files were corrupted, so I had to start from scratch, and it worked. thanks @Jason @navin_hariharan for your help
Unknown Userβ’6mo ago
Message Not Public
Sign In & Join Server To View
Anytime! Glad you got it to work π