MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Xin Qi, Yujun Wen*, Pengzhou Zhang, Heyan Huang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

4 Citations (Scopus)

Abstract

Speech emotion recognition (SER) is challenging owing to the complexity of emotional representation. Hence, this article focuses on multimodal speech emotion recognition that analyzes the speaker's sentiment state via audio signals and textual content. Existing multimodal approaches utilize sequential networks to capture the temporal dependency in various feature sequences, ignoring the underlying relations in acoustic and textual modalities. Moreover, current feature-level and decision-level fusion methods have unresolved limitations. Therefore, this paper develops a novel multimodal fusion graph convolutional network that comprehensively executes information interactions within and between the two modalities. Specifically, we construct the intra-modal relations to excavate exclusive intrinsic characteristics in each modality. For the inter-modal fusion, a multi-perspective fusion mechanism is devised to integrate the complementary information between the two modalities. Substantial experiments on the IEMOCAP and RAVDESS datasets and experimental results demonstrate that our approach achieves superior performance.

Original languageEnglish
Article number128646
JournalNeurocomputing
Volume611
DOIs
Publication statusPublished - 1 Jan 2025

Keywords

  • Graph convolutional networks
  • Multimodal learning
  • Representation learning
  • Speech emotion recognition

Fingerprint

Dive into the research topics of 'MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition'. Together they form a unique fingerprint.

Cite this