Abstract
Generating talking face videos from a static portrait image with a reference video remains challenging due to inconsistencies in motion performance and potential artifacts such as frame jitter or facial distortion. The key to solving these challenges lies in the learning of motion features, including accurate representation and effective alignment. In this paper, we propose a method that significantly improves the quality of generated talking videos by enhancing motion feature learning. Our method comprises two novel techniques: temporal representation augmentation (TRA) and spatial alignment correction (SAC). TRA improves the accuracy of motion feature representation through data augmentation in the temporal dimension, while SAC reduces alignment losses by optimizing spatial consistency between head pose and facial motion. Extensive experiments demonstrate that our method can accurately learn motion features from the reference video, resulting in natural and high-fidelity talking video generation. Compared to state-of-the-art methods, our method exhibits superior performance in terms of image quality, motion accuracy and video smoothness. Project and dataset website: https://github.com/donge1024/TalkingFace_Motion
Original language | English |
---|---|
Journal | Visual Computer |
DOIs | |
Publication status | Accepted/In press - 2025 |
Externally published | Yes |
Keywords
- Data augmentation
- Feature representation
- Spatial consistency
- Talking face