TY - JOUR
T1 - Decoupled Two-Stage Talking Head Generation via Gaussian-Landmark-Based Neural Radiance Fields
AU - Ma, Boyao
AU - Cao, Yuanping
AU - Zhang, Lei
N1 - Publisher Copyright:
© 2024 Tsinghua University Press.
PY - 2025
Y1 - 2025
N2 - Talking head generation based on neural radiance fields (NeRF) has gained prominence, primarily owing to its implicit 3D representation capability within neural networks. However, most NeRF-based methods often intertwine audio-to-video conversion in a joint training process, resulting in challenges such as inadequate lip synchronization, limited learning efficiency, large memory requirement, and lack of editability. In response to these issues, this paper introduces a fully decoupled NeRF-based method for generating talking heads. This method separates audio-to-video conversion into two stages through the use of facial landmarks. Notably, the Transformer network is used to effectively establish the cross-modal connection between audio and landmarks and to generate landmarks conforming to the distribution of training data. We also explore formant features of the audio as additional conditions to guide landmark generation. Then, these landmarks are combined with Gaussian relative position coding to refine the sampling points on the rays, thereby constructing a dynamic NeRF conditioned on these landmarks and audio features for rendering the generated head. This decoupled setup enhances both the fidelity and flexibility of mapping audio to video with two independent small-scale networks. Additionally, it supports the generation of the torso from the head-only image with improved StyleUnet, further enhancing the realism of the generated talking head. Our experimental results demonstrate that our method excels in producing lifelike talking heads, and that the lightweight neural network models also exhibit superior speed and learning efficiency with lower memory requirements.
AB - Talking head generation based on neural radiance fields (NeRF) has gained prominence, primarily owing to its implicit 3D representation capability within neural networks. However, most NeRF-based methods often intertwine audio-to-video conversion in a joint training process, resulting in challenges such as inadequate lip synchronization, limited learning efficiency, large memory requirement, and lack of editability. In response to these issues, this paper introduces a fully decoupled NeRF-based method for generating talking heads. This method separates audio-to-video conversion into two stages through the use of facial landmarks. Notably, the Transformer network is used to effectively establish the cross-modal connection between audio and landmarks and to generate landmarks conforming to the distribution of training data. We also explore formant features of the audio as additional conditions to guide landmark generation. Then, these landmarks are combined with Gaussian relative position coding to refine the sampling points on the rays, thereby constructing a dynamic NeRF conditioned on these landmarks and audio features for rendering the generated head. This decoupled setup enhances both the fidelity and flexibility of mapping audio to video with two independent small-scale networks. Additionally, it supports the generation of the torso from the head-only image with improved StyleUnet, further enhancing the realism of the generated talking head. Our experimental results demonstrate that our method excels in producing lifelike talking heads, and that the lightweight neural network models also exhibit superior speed and learning efficiency with lower memory requirements.
KW - Transformer
KW - audio-driven generation
KW - neural radiance fields (NeRF) rendering
KW - talking head
UR - https://www.scopus.com/pages/publications/105017844216
U2 - 10.26599/CVM.2025.9450482
DO - 10.26599/CVM.2025.9450482
M3 - Article
AN - SCOPUS:105017844216
SN - 2096-0433
VL - 11
SP - 799
EP - 816
JO - Computational Visual Media
JF - Computational Visual Media
IS - 4
ER -