TY - JOUR
T1 - Cross-modal dynamic convolution for multi-modal emotion recognition
AU - Wen, Huanglu
AU - You, Shaodi
AU - Fu, Ying
N1 - Publisher Copyright:
© 2021 Elsevier Inc.
PY - 2021/7
Y1 - 2021/7
N2 - Understanding human emotions requires information from different modalities like vocal, visual, and verbal. Since human emotion is time-varying, the related information is usually represented as temporal sequences and we need to identify both emotion-related clues and their cross-modal interactions inside. However, emotion-related clues are sparse and misaligned in temporally unaligned sequences, making it hard for previous multi-modal emotion recognition methods to catch helpful cross-modal interactions. To this end, we present cross-modal dynamic convolution. To deal with sparsity, cross-modal dynamic convolution models the temporal dimension locally to avoid being overwhelmed by unrelated information. Cross-modal dynamic convolution is easy to stack, enabling it to model long-range cross-modal temporal interactions. Besides, models with cross-modal dynamic convolution are more stable during training than with cross-modal attention, bringing more possibilities in multi-modal sequential model designing. Extensive experiments show that our method can achieve competitive performance compared to previous works while being more efficient.
AB - Understanding human emotions requires information from different modalities like vocal, visual, and verbal. Since human emotion is time-varying, the related information is usually represented as temporal sequences and we need to identify both emotion-related clues and their cross-modal interactions inside. However, emotion-related clues are sparse and misaligned in temporally unaligned sequences, making it hard for previous multi-modal emotion recognition methods to catch helpful cross-modal interactions. To this end, we present cross-modal dynamic convolution. To deal with sparsity, cross-modal dynamic convolution models the temporal dimension locally to avoid being overwhelmed by unrelated information. Cross-modal dynamic convolution is easy to stack, enabling it to model long-range cross-modal temporal interactions. Besides, models with cross-modal dynamic convolution are more stable during training than with cross-modal attention, bringing more possibilities in multi-modal sequential model designing. Extensive experiments show that our method can achieve competitive performance compared to previous works while being more efficient.
KW - Affective behavior
KW - Artificial neural networks
KW - Multi-modal temporal sequences
KW - Pattern recognition
UR - http://www.scopus.com/inward/record.url?scp=85108686372&partnerID=8YFLogxK
U2 - 10.1016/j.jvcir.2021.103178
DO - 10.1016/j.jvcir.2021.103178
M3 - Article
AN - SCOPUS:85108686372
SN - 1047-3203
VL - 78
JO - Journal of Visual Communication and Image Representation
JF - Journal of Visual Communication and Image Representation
M1 - 103178
ER -