Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer

Yue Yu; Jiande Shi

doi:10.1109/VRW58643.2023.00286

Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer

Yue Yu, Jiande Shi

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

To generate co-speech gestures for virtual agents and enhance the correlation between gestures and input modalities, we propose a Transformer-based model, which encodes four-modal-like information (Audio Waveform, Mel-Spectrogram, Text, and SpeakerIDs). For the Mel-Spectrogram modal, we design a Mel-Spectrogram encoder based on the Swin Transformer pre-trained model to extract the audio spectrum features hierarchically. For the Text modal, we use the Transformer encoder to extract text features aligned with the audio. We evaluate on the TED-Gesture dataset. Compared with the state-of-art methods, we improve the mean absolute joint error by 2.33%, the mean acceleration difference by 15.01%, and the Fréchet gesture distance by 59.32%.

Original language	English
Title of host publication	Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	887-888
Number of pages	2
ISBN (Electronic)	9798350348392
DOIs	https://doi.org/10.1109/VRW58643.2023.00286
Publication status	Published - 2023
Event	2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023 - Shanghai, China Duration: 25 Mar 2023 → 29 Mar 2023

Publication series

Name	Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023

Conference

Conference	2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023
Country/Territory	China
City	Shanghai
Period	25/03/23 → 29/03/23

Keywords

Computer systems organization
Computing methodologies-Co-Speech Gestures
Computing methodologies-Virtual Agents
Transformer

Access to Document

10.1109/VRW58643.2023.00286

Cite this

Yu, Y., & Shi, J. (2023). Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer. In Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023 (pp. 887-888). (Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/VRW58643.2023.00286

Yu, Yue ; Shi, Jiande. / Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer. Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023. Institute of Electrical and Electronics Engineers Inc., 2023. pp. 887-888 (Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023).

@inproceedings{a383ee2d70de4ff69b6ca852a6e2f71a,

title = "Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer",

abstract = "To generate co-speech gestures for virtual agents and enhance the correlation between gestures and input modalities, we propose a Transformer-based model, which encodes four-modal-like information (Audio Waveform, Mel-Spectrogram, Text, and SpeakerIDs). For the Mel-Spectrogram modal, we design a Mel-Spectrogram encoder based on the Swin Transformer pre-trained model to extract the audio spectrum features hierarchically. For the Text modal, we use the Transformer encoder to extract text features aligned with the audio. We evaluate on the TED-Gesture dataset. Compared with the state-of-art methods, we improve the mean absolute joint error by 2.33%, the mean acceleration difference by 15.01%, and the Fr{\'e}chet gesture distance by 59.32%.",

keywords = "Computer systems organization, Computing methodologies-Co-Speech Gestures, Computing methodologies-Virtual Agents, Transformer",

author = "Yue Yu and Jiande Shi",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023 ; Conference date: 25-03-2023 Through 29-03-2023",

year = "2023",

doi = "10.1109/VRW58643.2023.00286",

language = "English",

series = "Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "887--888",

booktitle = "Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023",

address = "United States",

}

Yu, Y & Shi, J 2023, Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer. in Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023. Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023, Institute of Electrical and Electronics Engineers Inc., pp. 887-888, 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023, Shanghai, China, 25/03/23. https://doi.org/10.1109/VRW58643.2023.00286

Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer. / Yu, Yue; Shi, Jiande.
Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023. Institute of Electrical and Electronics Engineers Inc., 2023. p. 887-888 (Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer

AU - Yu, Yue

AU - Shi, Jiande

PY - 2023

Y1 - 2023

N2 - To generate co-speech gestures for virtual agents and enhance the correlation between gestures and input modalities, we propose a Transformer-based model, which encodes four-modal-like information (Audio Waveform, Mel-Spectrogram, Text, and SpeakerIDs). For the Mel-Spectrogram modal, we design a Mel-Spectrogram encoder based on the Swin Transformer pre-trained model to extract the audio spectrum features hierarchically. For the Text modal, we use the Transformer encoder to extract text features aligned with the audio. We evaluate on the TED-Gesture dataset. Compared with the state-of-art methods, we improve the mean absolute joint error by 2.33%, the mean acceleration difference by 15.01%, and the Fréchet gesture distance by 59.32%.

AB - To generate co-speech gestures for virtual agents and enhance the correlation between gestures and input modalities, we propose a Transformer-based model, which encodes four-modal-like information (Audio Waveform, Mel-Spectrogram, Text, and SpeakerIDs). For the Mel-Spectrogram modal, we design a Mel-Spectrogram encoder based on the Swin Transformer pre-trained model to extract the audio spectrum features hierarchically. For the Text modal, we use the Transformer encoder to extract text features aligned with the audio. We evaluate on the TED-Gesture dataset. Compared with the state-of-art methods, we improve the mean absolute joint error by 2.33%, the mean acceleration difference by 15.01%, and the Fréchet gesture distance by 59.32%.

KW - Computer systems organization

KW - Computing methodologies-Co-Speech Gestures

KW - Computing methodologies-Virtual Agents

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85159687023&partnerID=8YFLogxK

U2 - 10.1109/VRW58643.2023.00286

DO - 10.1109/VRW58643.2023.00286

M3 - Conference contribution

AN - SCOPUS:85159687023

T3 - Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023

SP - 887

EP - 888

BT - Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023

Y2 - 25 March 2023 through 29 March 2023

ER -

Yu Y, Shi J. Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer. In Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023. Institute of Electrical and Electronics Engineers Inc. 2023. p. 887-888. (Proceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023). doi: 10.1109/VRW58643.2023.00286

Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this