Cross-modal dynamic convolution for multi-modal emotion recognition

Huanglu Wen, Shaodi You, Ying Fu*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

16 Citations (Scopus)

Abstract

Understanding human emotions requires information from different modalities like vocal, visual, and verbal. Since human emotion is time-varying, the related information is usually represented as temporal sequences and we need to identify both emotion-related clues and their cross-modal interactions inside. However, emotion-related clues are sparse and misaligned in temporally unaligned sequences, making it hard for previous multi-modal emotion recognition methods to catch helpful cross-modal interactions. To this end, we present cross-modal dynamic convolution. To deal with sparsity, cross-modal dynamic convolution models the temporal dimension locally to avoid being overwhelmed by unrelated information. Cross-modal dynamic convolution is easy to stack, enabling it to model long-range cross-modal temporal interactions. Besides, models with cross-modal dynamic convolution are more stable during training than with cross-modal attention, bringing more possibilities in multi-modal sequential model designing. Extensive experiments show that our method can achieve competitive performance compared to previous works while being more efficient.

Original languageEnglish
Article number103178
JournalJournal of Visual Communication and Image Representation
Volume78
DOIs
Publication statusPublished - Jul 2021

Keywords

  • Affective behavior
  • Artificial neural networks
  • Multi-modal temporal sequences
  • Pattern recognition

Cite this