Manitalk: manipulable talking head generation from single image in the wild

Hui Fang, Dongdong Weng*, Zeyu Tian, Yin Ma

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Generating talking head videos through a face image and a piece of speech audio has gained widespread interest. Existing talking face synthesis methods typically lack the ability to generate manipulable facial details and pupils, which is desirable for producing stylized facial expressions. We present ManiTalk, the first manipulable audio-driven talking head generation system. Our system consists of three stages. In the first stage, the proposed Exp Generator and Pose Generator generate synchronized talking landmarks and presentation-style head poses. In the second stage, we parameterize the positions of eyebrows, eyelids, and pupils, enabling personalized and straightforward manipulation of facial details. In the last stage, we introduce SFWNet to warp facial images based on the landmark motions. Additional driving sketches are input to generate more precise expressions. Extensive quantitative and qualitative evaluations, along with user studies, demonstrate that the system can accurately manipulate facial details and achieve excellent lip synchronization. Our system achieves state-of-the-art performance in terms of identity preservation and video quality. Code is available at https://github.com/shanzhajuan/ManiTalk.

Original languageEnglish
Pages (from-to)4913-4925
Number of pages13
JournalVisual Computer
Volume40
Issue number7
DOIs
Publication statusPublished - Jul 2024

Keywords

  • Expression manipulation
  • Facial animation
  • Gaze manipulation
  • Neural network

Cite this