Enhanced Temporal Representation and Spatial Alignment for High-Fidelity Talking Video Generation

Biao Dong, Lei Zhang*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Generating talking face videos from a static portrait image with a reference video remains challenging due to inconsistencies in motion performance and potential artifacts such as frame jitter or facial distortion. The key to solving these challenges lies in the learning of motion features, including accurate representation and effective alignment. In this paper, we propose a method that significantly improves the quality of generated talking videos by enhancing motion feature learning. Our method comprises two novel techniques: temporal representation augmentation (TRA) and spatial alignment correction (SAC). TRA improves the accuracy of motion feature representation through data augmentation in the temporal dimension, while SAC reduces alignment losses by optimizing spatial consistency between head pose and facial motion. Extensive experiments demonstrate that our method can accurately learn motion features from the reference video, resulting in natural and high-fidelity talking video generation. Compared to state-of-the-art methods, our method exhibits superior performance in terms of image quality, motion accuracy and video smoothness. Project and dataset website: https://github.com/donge1024/TalkingFace_Motion

Original languageEnglish
JournalVisual Computer
DOIs
Publication statusAccepted/In press - 2025
Externally publishedYes

Keywords

  • Data augmentation
  • Feature representation
  • Spatial consistency
  • Talking face

Cite this