Abstract
Understanding human emotions requires information from different modalities like vocal, visual, and verbal. Since human emotion is time-varying, the related information is usually represented as temporal sequences and we need to identify both emotion-related clues and their cross-modal interactions inside. However, emotion-related clues are sparse and misaligned in temporally unaligned sequences, making it hard for previous multi-modal emotion recognition methods to catch helpful cross-modal interactions. To this end, we present cross-modal dynamic convolution. To deal with sparsity, cross-modal dynamic convolution models the temporal dimension locally to avoid being overwhelmed by unrelated information. Cross-modal dynamic convolution is easy to stack, enabling it to model long-range cross-modal temporal interactions. Besides, models with cross-modal dynamic convolution are more stable during training than with cross-modal attention, bringing more possibilities in multi-modal sequential model designing. Extensive experiments show that our method can achieve competitive performance compared to previous works while being more efficient.
| Original language | English |
|---|---|
| Article number | 103178 |
| Journal | Journal of Visual Communication and Image Representation |
| Volume | 78 |
| DOIs | |
| Publication status | Published - Jul 2021 |
Keywords
- Affective behavior
- Artificial neural networks
- Multi-modal temporal sequences
- Pattern recognition
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver