Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer

Yue Yu, Jiande Shi

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

To generate co-speech gestures for virtual agents and enhance the correlation between gestures and input modalities, we propose a Transformer-based model, which encodes four-modal-like information (Audio Waveform, Mel-Spectrogram, Text, and SpeakerIDs). For the Mel-Spectrogram modal, we design a Mel-Spectrogram encoder based on the Swin Transformer pre-trained model to extract the audio spectrum features hierarchically. For the Text modal, we use the Transformer encoder to extract text features aligned with the audio. We evaluate on the TED-Gesture dataset. Compared with the state-of-art methods, we improve the mean absolute joint error by 2.33%, the mean acceleration difference by 15.01%, and the Fréchet gesture distance by 59.32%.

源语言英语
主期刊名Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023
出版商Institute of Electrical and Electronics Engineers Inc.
887-888
页数2
ISBN(电子版)9798350348392
DOI
出版状态已出版 - 2023
活动2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023 - Shanghai, 中国
期限: 25 3月 202329 3月 2023

出版系列

姓名Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023

会议

会议2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023
国家/地区中国
Shanghai
时期25/03/2329/03/23

指纹

探究 'Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer' 的科研主题。它们共同构成独一无二的指纹。

引用此