TY - JOUR
T1 - Cross-modal context-gated convolution for multi-modal sentiment analysis
AU - Wen, Huanglu
AU - You, Shaodi
AU - Fu, Ying
N1 - Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2021/6
Y1 - 2021/6
N2 - When inferring sentiments, using verbal clues only is problematic because of the ambiguity. Adding related vocal and visual contexts as complements for verbal clues can be helpful. To infer sentiments from multi-modal temporal sequences, we need to identify both sentiment-related clues and their cross-modal interactions. However, sentiment-related behaviors of different modalities may not occur at the same time. These behaviors and their interactions are also sparse in time, making it hard to infer the correct sentiments. Besides, unaligned sequences from sensors also have varying sampling rates, which amplify the misalignment and sparsity mentioned above. While most previous multi-modal sentiment analysis works only focus on word-aligned sequences, we propose cross-modal context-gated convolution for unaligned sequences. Cross-modal context-gated convolution captures the local cross-modal interactions, dealing with the misalignment while reducing the effect of unrelated information. Cross-modal context-gated convolution introduces the concept of cross-modal context gate, enabling itself to catch useful cross-modal interactions more effectively. Cross-modal context-gated convolution also brings more possibilities to the layer design for multi-modal sequential modeling. Experiments on multi-modal sentiment analysis datasets under both word-aligned and unaligned conditions show the validity of our approach.
AB - When inferring sentiments, using verbal clues only is problematic because of the ambiguity. Adding related vocal and visual contexts as complements for verbal clues can be helpful. To infer sentiments from multi-modal temporal sequences, we need to identify both sentiment-related clues and their cross-modal interactions. However, sentiment-related behaviors of different modalities may not occur at the same time. These behaviors and their interactions are also sparse in time, making it hard to infer the correct sentiments. Besides, unaligned sequences from sensors also have varying sampling rates, which amplify the misalignment and sparsity mentioned above. While most previous multi-modal sentiment analysis works only focus on word-aligned sequences, we propose cross-modal context-gated convolution for unaligned sequences. Cross-modal context-gated convolution captures the local cross-modal interactions, dealing with the misalignment while reducing the effect of unrelated information. Cross-modal context-gated convolution introduces the concept of cross-modal context gate, enabling itself to catch useful cross-modal interactions more effectively. Cross-modal context-gated convolution also brings more possibilities to the layer design for multi-modal sequential modeling. Experiments on multi-modal sentiment analysis datasets under both word-aligned and unaligned conditions show the validity of our approach.
KW - Affective behavior
KW - Artificial neural networks
KW - Multi-modal temporal sequences
KW - Pattern recognition
UR - http://www.scopus.com/inward/record.url?scp=85103944828&partnerID=8YFLogxK
U2 - 10.1016/j.patrec.2021.03.025
DO - 10.1016/j.patrec.2021.03.025
M3 - Article
AN - SCOPUS:85103944828
SN - 0167-8655
VL - 146
SP - 252
EP - 259
JO - Pattern Recognition Letters
JF - Pattern Recognition Letters
ER -