Highly Stylized Music Video

I’m planning a highly stylized music video and want to sanity-check my workflow before I dive back in. Here’s what I’m trying to achieve:

+
• The music is already fully recorded.

• I want to drive an avatar’s full body and facial performance using video of my real vocal, keyboard, and guitar performances.

• I want to generate multiple stylized versions of myself (alien, cyborg, etc.) using my trained Flux model. I trained it about a year ago, so I may need to retrain or fine-tune a newer model.

• I want to control a virtual choir, all singing the same part, using my own performance as the driver.

• All scenes should take place in alien or otherworldly environments.

• I want dynamic, cinematic camera movement throughout.


A year ago I paused this project because video generation was too limited. Now that ControlNet-style conditioning for video exists (e.g., LTX-2), I’m considering picking it back up—though getting these tools running is still a bit of a headache.

I’m a software engineer, so I’m totally comfortable with open-source tools — but I’m also willing to pay for a commercial solution (like Kling) if it meaningfully improves the workflow or final quality.

I’ve attached a short video clip to show the kind of result I was getting. (none of this is performance driven)


If you were in my position today—with these goals, this hardware, and a trained model of yourself—what would your end‑to‑end workflow look like?

What tools, steps, and pipeline would you use to pull off a project like this in 2026?
Was this page helpful?