High-Fidelity and High-Efficiency Talking Portrait Synthesis With Detail-Aware Neural Radiance Fields

Muyu Wang; Sanyuan Zhao; Xingping Dong; Jianbing Shen

doi:10.1109/TVCG.2024.3488960

High-Fidelity and High-Efficiency Talking Portrait Synthesis With Detail-Aware Neural Radiance Fields

Muyu Wang, Sanyuan Zhao, Xingping Dong^*, Jianbing Shen

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

Abstract

In this paper, we propose a novel rendering framework based on neural radiance fields (NeRF) named HH-NeRF that can generate high-resolution audio-driven talking portrait videos with high fidelity and fast rendering. Specifically, our framework includes a detail-aware NeRF module and an efficient conditional super-resolution module. Firstly, a detail-aware NeRF is proposed to efficiently generate a high-fidelity low-resolution talking head, by using the encoded volume density estimation and audio-eye-aware color calculation. This module can capture natural eye blinks and high-frequency details, and maintain a similar rendering time as previous fast methods. Secondly, we present an efficient conditional super-resolution module on the dynamic scene to directly generate the high-resolution portrait with our low-resolution head. Incorporated with the prior information, such as depth map and audio features, our new proposed efficient conditional super resolution module can adopt a lightweight network to efficiently generate realistic and distinct high-resolution videos. Extensive experiments demonstrate that our method can generate more distinct and fidelity talking portraits on high resolution (900 × 900) videos compared to state-of-the-art methods. Our code is available at https://github.com/muyuWang/HHNeRF.

Original language	English
Journal	IEEE Transactions on Visualization and Computer Graphics
DOIs	https://doi.org/10.1109/TVCG.2024.3488960
Publication status	Accepted/In press - 2024

Keywords

Audio
neural radiance fields
super-resolution
talking portrait

Access to Document

10.1109/TVCG.2024.3488960

Cite this

@article{ba75fb6c58e34ca9bf39bbf677d638b4,

title = "High-Fidelity and High-Efficiency Talking Portrait Synthesis With Detail-Aware Neural Radiance Fields",

abstract = "In this paper, we propose a novel rendering framework based on neural radiance fields (NeRF) named HH-NeRF that can generate high-resolution audio-driven talking portrait videos with high fidelity and fast rendering. Specifically, our framework includes a detail-aware NeRF module and an efficient conditional super-resolution module. Firstly, a detail-aware NeRF is proposed to efficiently generate a high-fidelity low-resolution talking head, by using the encoded volume density estimation and audio-eye-aware color calculation. This module can capture natural eye blinks and high-frequency details, and maintain a similar rendering time as previous fast methods. Secondly, we present an efficient conditional super-resolution module on the dynamic scene to directly generate the high-resolution portrait with our low-resolution head. Incorporated with the prior information, such as depth map and audio features, our new proposed efficient conditional super resolution module can adopt a lightweight network to efficiently generate realistic and distinct high-resolution videos. Extensive experiments demonstrate that our method can generate more distinct and fidelity talking portraits on high resolution (900 × 900) videos compared to state-of-the-art methods. Our code is available at https://github.com/muyuWang/HHNeRF.",

keywords = "Audio, neural radiance fields, super-resolution, talking portrait",

author = "Muyu Wang and Sanyuan Zhao and Xingping Dong and Jianbing Shen",

note = "Publisher Copyright: {\textcopyright} 1995-2012 IEEE.",

year = "2024",

doi = "10.1109/TVCG.2024.3488960",

language = "English",

journal = "IEEE Transactions on Visualization and Computer Graphics",

issn = "1077-2626",

publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - High-Fidelity and High-Efficiency Talking Portrait Synthesis With Detail-Aware Neural Radiance Fields

AU - Wang, Muyu

AU - Zhao, Sanyuan

AU - Dong, Xingping

AU - Shen, Jianbing

PY - 2024

Y1 - 2024

N2 - In this paper, we propose a novel rendering framework based on neural radiance fields (NeRF) named HH-NeRF that can generate high-resolution audio-driven talking portrait videos with high fidelity and fast rendering. Specifically, our framework includes a detail-aware NeRF module and an efficient conditional super-resolution module. Firstly, a detail-aware NeRF is proposed to efficiently generate a high-fidelity low-resolution talking head, by using the encoded volume density estimation and audio-eye-aware color calculation. This module can capture natural eye blinks and high-frequency details, and maintain a similar rendering time as previous fast methods. Secondly, we present an efficient conditional super-resolution module on the dynamic scene to directly generate the high-resolution portrait with our low-resolution head. Incorporated with the prior information, such as depth map and audio features, our new proposed efficient conditional super resolution module can adopt a lightweight network to efficiently generate realistic and distinct high-resolution videos. Extensive experiments demonstrate that our method can generate more distinct and fidelity talking portraits on high resolution (900 × 900) videos compared to state-of-the-art methods. Our code is available at https://github.com/muyuWang/HHNeRF.

AB - In this paper, we propose a novel rendering framework based on neural radiance fields (NeRF) named HH-NeRF that can generate high-resolution audio-driven talking portrait videos with high fidelity and fast rendering. Specifically, our framework includes a detail-aware NeRF module and an efficient conditional super-resolution module. Firstly, a detail-aware NeRF is proposed to efficiently generate a high-fidelity low-resolution talking head, by using the encoded volume density estimation and audio-eye-aware color calculation. This module can capture natural eye blinks and high-frequency details, and maintain a similar rendering time as previous fast methods. Secondly, we present an efficient conditional super-resolution module on the dynamic scene to directly generate the high-resolution portrait with our low-resolution head. Incorporated with the prior information, such as depth map and audio features, our new proposed efficient conditional super resolution module can adopt a lightweight network to efficiently generate realistic and distinct high-resolution videos. Extensive experiments demonstrate that our method can generate more distinct and fidelity talking portraits on high resolution (900 × 900) videos compared to state-of-the-art methods. Our code is available at https://github.com/muyuWang/HHNeRF.

KW - Audio

KW - neural radiance fields

KW - super-resolution

KW - talking portrait

UR - http://www.scopus.com/inward/record.url?scp=85208531865&partnerID=8YFLogxK

U2 - 10.1109/TVCG.2024.3488960

DO - 10.1109/TVCG.2024.3488960

M3 - Article

AN - SCOPUS:85208531865

SN - 1077-2626

JO - IEEE Transactions on Visualization and Computer Graphics

JF - IEEE Transactions on Visualization and Computer Graphics

ER -

High-Fidelity and High-Efficiency Talking Portrait Synthesis With Detail-Aware Neural Radiance Fields

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this