98% Speed Optimization Achieved - Can We Go Further?

Current Setup & Results
Architecture:

RunPod Serverless + ComfyUI
InfiniteTalk I2V workflow (Image-to-Video with audio)
Multiple large models: Wav2Vec2, LivePortrait, CogVideoX, etc.
80GB VRAM GPUs (A100/H100)
Performance Journey:

V0.22 (baseline): Every job loaded models from scratch = ~297s per job
V0.23 (optimized): Pre-warm models during worker init, keep in GPU memory
First job (cold start): ~322s (one-time worker init + pre-warming)
Subsequent jobs: 5-6 seconds (just inference)
98.2% improvement for warm jobs (54x speedup)
Current Setup & Results
Architecture:

RunPod Serverless + ComfyUI
InfiniteTalk I2V workflow (Image-to-Video with audio)
Multiple large models: Wav2Vec2, LivePortrait, CogVideoX, etc.
80GB VRAM GPUs (A100/H100)
Performance Journey:

V0.22 (baseline): Every job loaded models from scratch = ~297s per job
V0.23 (optimized): Pre-warm models during worker init, keep in GPU memory
First job (cold start): ~322s (one-time worker init + pre-warming)
Subsequent jobs: 5-6 seconds (just inference)
98.2% improvement for warm jobs (54x speedup)
How It Works Modified the handler to detect worker initialization and pre-load all models into GPU memory --- The Question Is ~5-6s the floor for this type of workflow, or can we optimize further? Potential areas to explore: -Batch processing multiple frames at once? -Model quantization (FP8/INT8) without quality loss? -Compile models with torch.compile()? -Pipeline parallelization (overlap stages)? -Faster storage backend for outputs? -WebSocket streaming vs polling?
2 Replies
! cypher
! cypherOP2w ago
For context: This is a full video generation pipeline with facial animation + lip sync, not just a single model inference.
Dj
Dj2w ago
Hey, going through my backlog here - We are working on a faster storage backend, and I think compiling models will help you but it will hurt your first job time as it may need a minute or so.

Did you find this page helpful?