Given only a few-minute-long video of a person speaking with the audio track as the training data and arbitrary texts as the driving input, the authors aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achiev...