TY - JOUR
T1 - Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification
AU - Peng, Cheng
AU - Zhang, Chunxia
AU - Xue, Xiaojun
AU - Gao, Jiameng
AU - Liang, Hongjian
AU - Niu, Zhengdong
N1 - Publisher Copyright:
© 2022 Tsinghua University Press.
PY - 2022/8/1
Y1 - 2022/8/1
N2 - Multimodal Sentiment Classification (MSC) uses multimodal data, such as images and texts, to identify the users' sentiment polarities from the information posted by users on the Internet. MSC has attracted considerable attention because of its wide applications in social computing and opinion mining. However, improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate. Moreover, simply concatenating them modal by modal, even with true correlation, cannot fully capture the features within and between modals. To solve these problems, this paper proposes a Cross-Modal Complementary Network (CMCN) with hierarchical fusion for MSC. The CMCN is designed as a hierarchical structure with three key modules, namely, the feature extraction module to extract features from texts and images, the feature attention module to learn both text and image attention features generated by an image-text correlation generator, and the cross-modal hierarchical fusion module to fuse features within and between modals. Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features. Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.
AB - Multimodal Sentiment Classification (MSC) uses multimodal data, such as images and texts, to identify the users' sentiment polarities from the information posted by users on the Internet. MSC has attracted considerable attention because of its wide applications in social computing and opinion mining. However, improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate. Moreover, simply concatenating them modal by modal, even with true correlation, cannot fully capture the features within and between modals. To solve these problems, this paper proposes a Cross-Modal Complementary Network (CMCN) with hierarchical fusion for MSC. The CMCN is designed as a hierarchical structure with three key modules, namely, the feature extraction module to extract features from texts and images, the feature attention module to learn both text and image attention features generated by an image-text correlation generator, and the cross-modal hierarchical fusion module to fuse features within and between modals. Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features. Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.
KW - Cross-Modal Complementary Network (CMCN)
KW - hierarchical fusion
KW - joint optimization
KW - multimodal fusion
KW - multimodal sentiment analysis
UR - http://www.scopus.com/inward/record.url?scp=85121696912&partnerID=8YFLogxK
U2 - 10.26599/TST.2021.9010055
DO - 10.26599/TST.2021.9010055
M3 - Article
AN - SCOPUS:85121696912
SN - 1007-0214
VL - 27
SP - 664
EP - 679
JO - Tsinghua Science and Technology
JF - Tsinghua Science and Technology
IS - 4
ER -