R
Runpod2mo ago
pave7946

How Do You Speed Up ComfyUI Serverless?

Hi community! I'm starting this thread to gather our collective knowledge on optimizing ComfyUI on RunPod Serverless. My goal is for us to share best practices and solve a tricky performance issue I'm facing. Step 1: The Initial Problem (NORMAL_VRAM mode) I started by checking my logs on both an A100 (80GB) and an L4 (24GB) worker. I noticed both were defaulting to NORMAL_VRAM mode, which seems suboptimal. --- L4 --- Total VRAM 22478 MB, total RAM 515498 MB Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA L4 : cudaMallocAsync ComfyUI version: 0.3.43 --- A100 --- Total VRAM 81038 MB, total RAM 2051931 MB Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA A100 80GB PCIe : cudaMallocAsync ComfyUI version: 0.3.43 Step 2: The Try and Fix and the issues My first action was to add the --highvram flag to my launch command. This worked, and the logs now correctly show Set vram state to: HIGH_VRAM. However, this is where I'm stuck and need your help. Despite being in HIGH_VRAM mode, the performance is still poor, and new issues have appeared: The CPU usage is constantly stuck at 100%. On the smaller 24GB L4, my workflow now fails with an OOM (Out Of Memory) error from the UNETLoader. The GPU (A100/L4) is at 30/40% during the entire job. This makes me suspect that other launch arguments I'm using (like --gpu-only) might be conflicting and preventing the workload from being properly offloaded to the GPU. Let's Turn This Into a Knowledge-Sharing Thread! To help solve this and create a resource for everyone, would you be willing to share the launch settings you use to run ComfyUI effectively? I'm especially interested in: Your full launch command from your start.sh or worker file. The type of GPU you're running on. Any key flags you've found essential for good performance (e.g., --preview-method auto, --disable-xformers, etc.). Any other "secret sauce" for reducing cold start times or speeding up inference. Thanks for sharing your expertise!
11 Replies
pave7946
pave7946OP2mo ago
Here is a text file that lists the different arguments for starting ComfyUI.
Morganja
Morganja2mo ago
ComfyUI ... is slow... You want fast write your own pipeline worker.
pave7946
pave7946OP2mo ago
Super simple workflow with: flux1-dev-fp8.safetensors flux1-dev-vae.safetensors clip_l.safetensors t5xxl_fp16.safetensors Max Workers: 1 Active Workers (Up to ~30% off): 0 GPU Count: 1 Idle Timeout: 5s Execution Timeout: 600s Enable Flashboot: False Delay Time: 7.46s Execution Time: 2m 9s
Morganja
Morganja2mo ago
Yeah I had about the same .... 90sec rendering time ... with my own pipeline I avg 20sec on an A40
pave7946
pave7946OP2mo ago
I also report the start.sh file with the arguments with which to run comfyui
pave7946
pave7946OP2mo ago
Don't you think 2m 9s is a long time, compared to your 90s or 20s? Thanks for the link, I'm reading the part about acceleration inference and reduce memory and I have to say it's very interesting and well written.
Morganja
Morganja2mo ago
Yeah, not to mention expensive as hell! There's no easy answer though. I got annoyed using ComfyUI so I built a better solution. but there is a few flags that help ... there's a -- arguement for not loading the model on startup .. that helps with the UI. umm aside from that yeah ... use accelerate ... try and split your Loras from the main GPU (allocate 2) theres a node pack called MultiGPU that will help. But in general ComfyUI is buggy, heavy, UI orientated software ... cli mode works but, IYKYK trust me if you want to run with any type of volume unless you can afford dedicated hardware. There's better.. I think someone has even written a flux wrapper for InvokeAI if all esle fails.
gokuvonlange
gokuvonlange2mo ago
Dedicated workflow pipeline written in python will always be superior. Personally I still like ComfyUI because of its flexiblity and always the first to have support for new methods via nodes to play around. I run very complex workflows with masking, object detections, going through several passes. Sure I could strip out all the code from the custom_nodes and make my own pipeline. But I like to have that flexibility to change my JSON workflow and not needing rebuild my image. I've made great strides in speeding up ComfyUI for my use-case. I'm primarily focused on Wan 2.1 and a single img2vid job on a H100 in 1280x720 x 5 seconds takes me about ~2 minutes. Thats roughly 0,14 cents per video which is including interpolation to 30fps. I use optimizations like Torch Compile, Sage Attention (triton), Cublas, FP16 accumulation and lora's that need 8 sampling steps. Try implementing Torch Compile and Sage Attention first.. those are absolute musts. Without these optimizations it can takes the worker like ~3,5 minutes. Also make sure you package your models in the Docker image and don't use network volume. The mounted network storage introduces I/O lag which will greatly increase your cold-startup times and FlashBoots
ridg3wood
ridg3wood2mo ago
Anyone have thoughts on network storage vs keeping it all in the docker container? Any other ideas for optimization of serverless in general?
gokuvonlange
gokuvonlange2mo ago
I package everything in the docker container. I've tried keeping it on a network volume but the IO latency is too much. Every job had a coldstart of loading models of a minute to minute and half. The moment I started packaging all my models, is when I saw a major speed gain

Did you find this page helpful?