How Do You Speed Up ComfyUI Serverless?
Hi community!
I'm starting this thread to gather our collective knowledge on optimizing ComfyUI on RunPod Serverless. My goal is for us to share best practices and solve a tricky performance issue I'm facing.
Step 1: The Initial Problem (NORMAL_VRAM mode)
I started by checking my logs on both an A100 (80GB) and an L4 (24GB) worker. I noticed both were defaulting to NORMAL_VRAM mode, which seems suboptimal.
--- L4 ---
Total VRAM 22478 MB, total RAM 515498 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA L4 : cudaMallocAsync
ComfyUI version: 0.3.43
--- A100 ---
Total VRAM 81038 MB, total RAM 2051931 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA A100 80GB PCIe : cudaMallocAsync
ComfyUI version: 0.3.43
Step 2: The Try and Fix and the issues
My first action was to add the --highvram flag to my launch command. This worked, and the logs now correctly show Set vram state to: HIGH_VRAM.
However, this is where I'm stuck and need your help. Despite being in HIGH_VRAM mode, the performance is still poor, and new issues have appeared:
The CPU usage is constantly stuck at 100%.
On the smaller 24GB L4, my workflow now fails with an OOM (Out Of Memory) error from the UNETLoader.
The GPU (A100/L4) is at 30/40% during the entire job.
This makes me suspect that other launch arguments I'm using (like --gpu-only) might be conflicting and preventing the workload from being properly offloaded to the GPU.
Let's Turn This Into a Knowledge-Sharing Thread!
To help solve this and create a resource for everyone, would you be willing to share the launch settings you use to run ComfyUI effectively?
I'm especially interested in:
Your full launch command from your start.sh or worker file.
The type of GPU you're running on.
Any key flags you've found essential for good performance (e.g., --preview-method auto, --disable-xformers, etc.).
Any other "secret sauce" for reducing cold start times or speeding up inference.
Thanks for sharing your expertise!
11 Replies
Here is a text file that lists the different arguments for starting ComfyUI.
ComfyUI ... is slow... You want fast write your own pipeline worker.
Super simple workflow with:
flux1-dev-fp8.safetensors
flux1-dev-vae.safetensors
clip_l.safetensors
t5xxl_fp16.safetensors
Max Workers: 1
Active Workers (Up to ~30% off): 0
GPU Count: 1
Idle Timeout: 5s
Execution Timeout: 600s
Enable Flashboot: False
Delay Time: 7.46s
Execution Time: 2m 9s
Yeah I had about the same .... 90sec rendering time ... with my own pipeline I avg 20sec on an A40
Don't you think 2m 9s is a long time, compared to your 90s or 20s?
Thanks for the link, I'm reading the part about acceleration inference and reduce memory and I have to say it's very interesting and well written.
Yeah, not to mention expensive as hell! There's no easy answer though. I got annoyed using ComfyUI so I built a better solution.
but there is a few flags that help ... there's a -- arguement for not loading the model on startup .. that helps with the UI.
umm aside from that yeah ... use accelerate ... try and split your Loras from the main GPU (allocate 2) theres a node pack called MultiGPU that will help.
But in general ComfyUI is buggy, heavy, UI orientated software ... cli mode works but, IYKYK trust me if you want to run with any type of volume unless you can afford dedicated hardware. There's better.. I think someone has even written a flux wrapper for InvokeAI if all esle fails.
Dedicated workflow pipeline written in python will always be superior.
Personally I still like ComfyUI because of its flexiblity and always the first to have support for new methods via nodes to play around.
I run very complex workflows with masking, object detections, going through several passes. Sure I could strip out all the code from the custom_nodes and make my own pipeline. But I like to have that flexibility to change my JSON workflow and not needing rebuild my image.
I've made great strides in speeding up ComfyUI for my use-case. I'm primarily focused on Wan 2.1 and a single img2vid job on a H100 in 1280x720 x 5 seconds takes me about ~2 minutes. Thats roughly 0,14 cents per video which is including interpolation to 30fps.
I use optimizations like Torch Compile, Sage Attention (triton), Cublas, FP16 accumulation and lora's that need 8 sampling steps.
Try implementing Torch Compile and Sage Attention first.. those are absolute musts.
Without these optimizations it can takes the worker like ~3,5 minutes.
Also make sure you package your models in the Docker image and don't use network volume. The mounted network storage introduces I/O lag which will greatly increase your cold-startup times and FlashBoots
Anyone have thoughts on network storage vs keeping it all in the docker container? Any other ideas for optimization of serverless in general?
I package everything in the docker container.
I've tried keeping it on a network volume but the IO latency is too much.
Every job had a coldstart of loading models of a minute to minute and half.
The moment I started packaging all my models, is when I saw a major speed gain