Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarks

Hui Fang, Dongdong Weng*, Zeyu Tian, Yin Ma, Xiangju Lu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system that can generate accurate 3D talking lips, readily applicable to unseen subjects and different languages. The developed head-mounted facial acquisition device and automated data processing pipeline can generate precise landmarks while mitigating the difficulty of acquiring 3D facial data. Our system consists of three stages to generate accurate lip movements. In the first stage, the fine-tuned Wav2Vec2.0+Transformer captures long-range audio context dependencies. In the second stage, we propose the Viseme Fixing method, which significantly improves lip accuracy at/b//p//m//f/ phonemes. In the last stage, we innovatively use the structural relationship between the inner and outer lips and learn to map the outer lip landmarks to the inner lip landmarks. Subjective evaluations show that the generated talking lips match the input audio significantly. We demonstrate two applications that animate 2D face videos and 3D face models using our landmarks. The precise lip landmarks allow the generated animations to exceed the results of state-of-the-art methods.

Original languageEnglish
Article number103925
JournalComputers and Graphics (Pergamon)
Volume120
DOIs
Publication statusPublished - May 2024

Keywords

  • 3D talking meshes
  • Landmarks
  • Lip animation
  • Viseme

Fingerprint

Dive into the research topics of 'Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarks'. Together they form a unique fingerprint.

Cite this