R
Runpod4mo ago
Simon

H100 Replicate VS RunPod

Hi, When I use a Flux model on Replicate, generating 4 images takes about 30 seconds and costs $0.001525 on an H100. On the other hand, with RunPod, generating the same 4 images takes 60 seconds and costs a bit more. How can I achieve the same processing time and cost on RunPod as I do on Replicate? I prefer using RunPod because I have more control over the workflow. Thanks!
45 Replies
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Simon
SimonOP4mo ago
Actually, the workflow I created on RunPod is the same as on Replicate, but on RunPod it's very inconsistent, it can take anywhere from 45 to 60 seconds, with a per-second cost that's higher than on Replicate. But there's nothing different; I'm just using a custom LoRA + Flux.dev. I'm going to try the L40s to see if that helps, thanks.
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Simon
SimonOP4mo ago
I'm using https://github.com/runpod-workers/worker-comfyui the model is already in the container, there are juste the lora in the network storage
GitHub
GitHub - runpod-workers/worker-comfyui: ComfyUI as a serverless API...
ComfyUI as a serverless API on RunPod. Contribute to runpod-workers/worker-comfyui development by creating an account on GitHub.
flash-singh
flash-singh4mo ago
we plan to have public endpoints for flux very soon, will support flux dev and schnell current times we get for 4090s is about 9 seconds per image right now its planned for end of this month, cost is about 50% of replicate
Simon
SimonOP4mo ago
but for flux.dev with a custom lora ?
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
we don't plan to support custom loras yet, thats a maybe for future
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
the fastest is using H100s which is about 4 seconds but then cost goes up
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
yes we plan to eventually have pro, wont be end of this month, for now its flux dev and schnell, others will come in july most likely, july is also when we plan to do video models
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
if you had the choice of faster image gen for higher cost or 2x slower image gen with half the cost, which would you pick?
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
sure thanks
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
@Jason which flux models you plan to use?
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
how fast is 1 image on replicate with flux dev?
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
from what @Simon mentioned, should be about 7-8 seconds
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
thats likely using go_fast which uses flux schnell
We provide a go_fast flag within the API which toggles a version of flux-schnell optimized for inference. Currently this version is a compiled fp8 quantization with an optimized attention kernel. We’ll update the model and this documentation as we develop further enhancements.
We provide a go_fast flag within the API which toggles a version of flux-schnell optimized for inference. Currently this version is a compiled fp8 quantization with an optimized attention kernel. We’ll update the model and this documentation as we develop further enhancements.
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
if they say it, otherwise no clue, so far best times for flux dev on h100 are 3-4s, if someone can do it faster on A100, ill be skeptical
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
schnell on H100 can do 1s for sure, dev is slower
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh4mo ago
just tried fal, its 2.89 seconds with 1024x1024 28 steps
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Simon
SimonOP4mo ago
On replicate I train a model with ostris/flux-dev-trainer then 1 image is like 9 seconds 4 -> 30 secondes
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Simon
SimonOP4mo ago
30 * $0.001525 = 0.045ct what I see in replicate is Downloaded weights in 0.64s it's very fast Loaded LoRAs in 2.02s it's also fast I think they have a cache and then 100%|██████████| 28/28 [00:08<00:00, 3.40it/s] with an h100
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Simon
SimonOP4mo ago
yes it's very close
flash-singh
flash-singh4mo ago
how big is the lora?
Simon
SimonOP4mo ago
680mb
flash-singh
flash-singh4mo ago
will have to explore loras in future, is that for flux dev or schnell?
Simon
SimonOP4mo ago
Flux Dev. I'm spending $2,000 a month on Replicate, if I can get close to the same response time and price per second, I'm switching to RunPod.
flash-singh
flash-singh4mo ago
are you passing loras using s3 bucket url or some other way?
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Simon
SimonOP4mo ago
yes replicate store the lora directly in their server, i'm using https://replicate.com/ostris/flux-dev-lora-trainer/train and what I read is that they use the "fast-booting"
Fast booting fine-tunes

Sometimes, we’re able to optimize how a trained model is run so it boots fast. This works by using a common, shared pool of hardware running a base model. In these cases, we only ever charge you for the time the model is active and processing your requests, regardless of whether or not it’s public or private.

Fast booting fine-tunes are labeled as such in the model’s version list. You can also see which versions support the creation of fast booting models when training.
Fast booting fine-tunes

Sometimes, we’re able to optimize how a trained model is run so it boots fast. This works by using a common, shared pool of hardware running a base model. In these cases, we only ever charge you for the time the model is active and processing your requests, regardless of whether or not it’s public or private.

Fast booting fine-tunes are labeled as such in the model’s version list. You can also see which versions support the creation of fast booting models when training.
ostris/flux-dev-lora-trainer – Replicate
Fine-tune FLUX.1-dev using ai-toolkit
Jassim
Jassim2mo ago
is it not slower putting models in the docker image ? im using network storage since i was told its faster that way
Unknown User
Unknown User2mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?