TY - GEN
T1 - ListenFormer
T2 - 32nd ACM International Conference on Multimedia, MM 2024
AU - Liu, Miao
AU - Wang, Jing
AU - Qian, Xinyuan
AU - Li, Haizhou
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - As one of the crucial elements in human-robot interaction, responsive listening head generation has attracted considerable attention from researchers. It aims to generate a listening head video based on speaker's audio and video as well as a reference listener image. However, existing methods exhibit two limitations: 1) the generation capability of their models is limited, resulting in generated videos that are far from real ones, and 2) they mostly employ autoregressive generative models, unable to mitigate the risk of error accumulation. To tackle these issues, we propose Listenformer that leverages the powerful temporal modeling capability of transformers for generation. It can perform non-autoregressive prediction with the proposed two-stage training method, simultaneously achieving temporal continuity and overall consistency in the outputs. To fully utilize the information from the speaker inputs, we designed an audio-motion attention fusion module, which improves the correlation of audio and motion features for accurate response. Additionally, a novel decoding method called sliding window with a large shift is proposed for Listenformer, demonstrating both excellent computational efficiency and effectiveness. Extensive experiments show that Listenformer outperforms the existing state-of-the-art methods on ViCo and L2L datasets. And a perceptual user study demonstrates the comprehensive performance of our method in generating diversity, identity preserving, speaker-listener synchronization, and attitude matching. Our code is available at https://liushenme.github.io/ListenFormer.github.io/.
AB - As one of the crucial elements in human-robot interaction, responsive listening head generation has attracted considerable attention from researchers. It aims to generate a listening head video based on speaker's audio and video as well as a reference listener image. However, existing methods exhibit two limitations: 1) the generation capability of their models is limited, resulting in generated videos that are far from real ones, and 2) they mostly employ autoregressive generative models, unable to mitigate the risk of error accumulation. To tackle these issues, we propose Listenformer that leverages the powerful temporal modeling capability of transformers for generation. It can perform non-autoregressive prediction with the proposed two-stage training method, simultaneously achieving temporal continuity and overall consistency in the outputs. To fully utilize the information from the speaker inputs, we designed an audio-motion attention fusion module, which improves the correlation of audio and motion features for accurate response. Additionally, a novel decoding method called sliding window with a large shift is proposed for Listenformer, demonstrating both excellent computational efficiency and effectiveness. Extensive experiments show that Listenformer outperforms the existing state-of-the-art methods on ViCo and L2L datasets. And a perceptual user study demonstrates the comprehensive performance of our method in generating diversity, identity preserving, speaker-listener synchronization, and attitude matching. Our code is available at https://liushenme.github.io/ListenFormer.github.io/.
KW - listening head generation
KW - transformer
KW - video synthesis
UR - http://www.scopus.com/inward/record.url?scp=85209783402&partnerID=8YFLogxK
U2 - 10.1145/3664647.3681182
DO - 10.1145/3664647.3681182
M3 - Conference contribution
AN - SCOPUS:85209783402
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 7094
EP - 7103
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2024 through 1 November 2024
ER -