98% Speed Optimization Achieved - Can We Go Further?
How It Works
Modified the handler to detect worker initialization and pre-load all models into GPU memory
---
The Question
Is ~5-6s the floor for this type of workflow, or can we optimize further?
Potential areas to explore:
-Batch processing multiple frames at once?
-Model quantization (FP8/INT8) without quality loss?
-Compile models with torch.compile()?
-Pipeline parallelization (overlap stages)?
-Faster storage backend for outputs?
-WebSocket streaming vs polling?
2 Replies
For context: This is a full video generation pipeline with facial animation + lip sync, not just a single model inference.
Hey, going through my backlog here -
We are working on a faster storage backend, and I think compiling models will help you but it will hurt your first job time as it may need a minute or so.