Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer

Yue Yu, Jiande Shi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

To generate co-speech gestures for virtual agents and enhance the correlation between gestures and input modalities, we propose a Transformer-based model, which encodes four-modal-like information (Audio Waveform, Mel-Spectrogram, Text, and SpeakerIDs). For the Mel-Spectrogram modal, we design a Mel-Spectrogram encoder based on the Swin Transformer pre-trained model to extract the audio spectrum features hierarchically. For the Text modal, we use the Transformer encoder to extract text features aligned with the audio. We evaluate on the TED-Gesture dataset. Compared with the state-of-art methods, we improve the mean absolute joint error by 2.33%, the mean acceleration difference by 15.01%, and the Fréchet gesture distance by 59.32%.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages887-888
Number of pages2
ISBN (Electronic)9798350348392
DOIs
Publication statusPublished - 2023
Event2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023 - Shanghai, China
Duration: 25 Mar 202329 Mar 2023

Publication series

NameProceedings - 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023

Conference

Conference2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023
Country/TerritoryChina
CityShanghai
Period25/03/2329/03/23

Keywords

  • Computer systems organization
  • Computing methodologies-Co-Speech Gestures
  • Computing methodologies-Virtual Agents
  • Transformer

Fingerprint

Dive into the research topics of 'Generating Co-Speech Gestures for Virtual Agents from Multimodal Information Based on Transformer'. Together they form a unique fingerprint.

Cite this