Hi community!
I'm starting this thread to gather our collective knowledge on optimizing ComfyUI on RunPod Serverless. My goal is for us to share best practices and solve a tricky performance issue I'm facing.
Step 1: The Initial Problem (NORMAL_VRAM mode)
I started by checking my logs on both an A100 (80GB) and an L4 (24GB) worker. I noticed both were defaulting to NORMAL_VRAM mode, which seems suboptimal.
--- L4 ---
Total VRAM 22478 MB, total RAM 515498 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA L4 : cudaMallocAsync
ComfyUI version: 0.3.43
--- A100 ---
Total VRAM 81038 MB, total RAM 2051931 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA A100 80GB PCIe : cudaMallocAsync
ComfyUI version: 0.3.43
Step 2: The Try and Fix and the issues
My first action was to add the --highvram flag to my launch command. This worked, and the logs now correctly show Set vram state to: HIGH_VRAM.
However, this is where I'm stuck and need your help. Despite being in HIGH_VRAM mode, the performance is still poor, and new issues have appeared:
The CPU usage is constantly stuck at 100%.
On the smaller 24GB L4, my workflow now fails with an OOM (Out Of Memory) error from the UNETLoader.
The GPU (A100/L4) is at 30/40% during the entire job.
This makes me suspect that other launch arguments I'm using (like --gpu-only) might be conflicting and preventing the workload from being properly offloaded to the GPU.
Let's Turn This Into a Knowledge-Sharing Thread!
To help solve this and create a resource for everyone, would you be willing to share the launch settings you use to run ComfyUI effectively?
I'm especially interested in:
Your full launch command from your start.sh or worker file.
The type of GPU you're running on.
Any key flags you've found essential for good performance (e.g., --preview-method auto, --disable-xformers, etc.).
Any other "secret sauce" for reducing cold start times or speeding up inference.
Thanks for sharing your expertise!