TY - GEN
T1 - Talking Head Generation via Viewpoint and Lighting Simulation Based on Global Representation
AU - Dong, Biao
AU - Zhang, Lei
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - NeRF-based talking head generation has made great progress, but existing methods still lack in achieving high-quality detail fidelity, mainly manifested in detail loss and intermittent blur. We attribute this to the limitations of the training video data in terms of viewpoint and lighting, which leads to the inability to fully model the global depth and brightness information of spatial points. Specifically, a fixed viewpoint may fail to provide sufficient depth information for high-frequency details, leading to inaccurate volume density estimation and the loss of details such as hair. Furthermore, constant lighting often fails to adapt to the drastic brightness changes of continuous video frames, resulting in color accumulation errors and blurring artifacts. To address these issues, we propose a novel talking head generation method that combines layered viewpoint simulation (LVS) and continuous lighting simulation (CLS). LVS simulates multiple viewpoints through the multi-scale features of the video frame to construct the global depth representation, which can improve the accuracy of volume density estimation and enhance detail description. CLS simulates multiple lighting through brightness changes of continuous video frames to construct the global brightness representation, thereby alleviating color accumulation errors and eliminating blur. Extensive experiments demonstrate that our method significantly improves the detail quality compared to the state-of-the-art methods.
AB - NeRF-based talking head generation has made great progress, but existing methods still lack in achieving high-quality detail fidelity, mainly manifested in detail loss and intermittent blur. We attribute this to the limitations of the training video data in terms of viewpoint and lighting, which leads to the inability to fully model the global depth and brightness information of spatial points. Specifically, a fixed viewpoint may fail to provide sufficient depth information for high-frequency details, leading to inaccurate volume density estimation and the loss of details such as hair. Furthermore, constant lighting often fails to adapt to the drastic brightness changes of continuous video frames, resulting in color accumulation errors and blurring artifacts. To address these issues, we propose a novel talking head generation method that combines layered viewpoint simulation (LVS) and continuous lighting simulation (CLS). LVS simulates multiple viewpoints through the multi-scale features of the video frame to construct the global depth representation, which can improve the accuracy of volume density estimation and enhance detail description. CLS simulates multiple lighting through brightness changes of continuous video frames to construct the global brightness representation, thereby alleviating color accumulation errors and eliminating blur. Extensive experiments demonstrate that our method significantly improves the detail quality compared to the state-of-the-art methods.
KW - lighting
KW - multimodality
KW - neural radiance fields
KW - talking head
KW - viewpoint
UR - https://www.scopus.com/pages/publications/105024071282
U2 - 10.1145/3746027.3755503
DO - 10.1145/3746027.3755503
M3 - Conference contribution
AN - SCOPUS:105024071282
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 10258
EP - 10267
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
T2 - 33rd ACM International Conference on Multimedia, MM 2025
Y2 - 27 October 2025 through 31 October 2025
ER -