Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification

Cheng Peng; Chunxia Zhang; Xiaojun Xue; Jiameng Gao; Hongjian Liang; Zhengdong Niu

doi:10.26599/TST.2021.9010055

Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification

Cheng Peng, Chunxia Zhang^*, Xiaojun Xue, Jiameng Gao, Hongjian Liang, Zhengdong Niu

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

27 Citations (Scopus)

Abstract

Multimodal Sentiment Classification (MSC) uses multimodal data, such as images and texts, to identify the users' sentiment polarities from the information posted by users on the Internet. MSC has attracted considerable attention because of its wide applications in social computing and opinion mining. However, improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate. Moreover, simply concatenating them modal by modal, even with true correlation, cannot fully capture the features within and between modals. To solve these problems, this paper proposes a Cross-Modal Complementary Network (CMCN) with hierarchical fusion for MSC. The CMCN is designed as a hierarchical structure with three key modules, namely, the feature extraction module to extract features from texts and images, the feature attention module to learn both text and image attention features generated by an image-text correlation generator, and the cross-modal hierarchical fusion module to fuse features within and between modals. Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features. Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.

Original language	English
Pages (from-to)	664-679
Number of pages	16
Journal	Tsinghua Science and Technology
Volume	27
Issue number	4
DOIs	https://doi.org/10.26599/TST.2021.9010055
Publication status	Published - 1 Aug 2022

Keywords

Cross-Modal Complementary Network (CMCN)
hierarchical fusion
joint optimization
multimodal fusion
multimodal sentiment analysis

Access to Document

10.26599/TST.2021.9010055

Cite this

Peng, C., Zhang, C., Xue, X., Gao, J., Liang, H., & Niu, Z. (2022). Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification. Tsinghua Science and Technology, 27(4), 664-679. https://doi.org/10.26599/TST.2021.9010055

@article{cfe76febbc1e4beeb51f5d21b496e08e,

title = "Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification",

abstract = "Multimodal Sentiment Classification (MSC) uses multimodal data, such as images and texts, to identify the users' sentiment polarities from the information posted by users on the Internet. MSC has attracted considerable attention because of its wide applications in social computing and opinion mining. However, improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate. Moreover, simply concatenating them modal by modal, even with true correlation, cannot fully capture the features within and between modals. To solve these problems, this paper proposes a Cross-Modal Complementary Network (CMCN) with hierarchical fusion for MSC. The CMCN is designed as a hierarchical structure with three key modules, namely, the feature extraction module to extract features from texts and images, the feature attention module to learn both text and image attention features generated by an image-text correlation generator, and the cross-modal hierarchical fusion module to fuse features within and between modals. Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features. Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.",

keywords = "Cross-Modal Complementary Network (CMCN), hierarchical fusion, joint optimization, multimodal fusion, multimodal sentiment analysis",

author = "Cheng Peng and Chunxia Zhang and Xiaojun Xue and Jiameng Gao and Hongjian Liang and Zhengdong Niu",

note = "Publisher Copyright: {\textcopyright} 2022 Tsinghua University Press.",

year = "2022",

month = aug,

day = "1",

doi = "10.26599/TST.2021.9010055",

language = "English",

volume = "27",

pages = "664--679",

journal = "Tsinghua Science and Technology",

issn = "1007-0214",

publisher = "Tsinghua University",

number = "4",

}

TY - JOUR

T1 - Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification

AU - Peng, Cheng

AU - Zhang, Chunxia

AU - Xue, Xiaojun

AU - Gao, Jiameng

AU - Liang, Hongjian

AU - Niu, Zhengdong

PY - 2022/8/1

Y1 - 2022/8/1

N2 - Multimodal Sentiment Classification (MSC) uses multimodal data, such as images and texts, to identify the users' sentiment polarities from the information posted by users on the Internet. MSC has attracted considerable attention because of its wide applications in social computing and opinion mining. However, improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate. Moreover, simply concatenating them modal by modal, even with true correlation, cannot fully capture the features within and between modals. To solve these problems, this paper proposes a Cross-Modal Complementary Network (CMCN) with hierarchical fusion for MSC. The CMCN is designed as a hierarchical structure with three key modules, namely, the feature extraction module to extract features from texts and images, the feature attention module to learn both text and image attention features generated by an image-text correlation generator, and the cross-modal hierarchical fusion module to fuse features within and between modals. Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features. Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.

AB - Multimodal Sentiment Classification (MSC) uses multimodal data, such as images and texts, to identify the users' sentiment polarities from the information posted by users on the Internet. MSC has attracted considerable attention because of its wide applications in social computing and opinion mining. However, improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate. Moreover, simply concatenating them modal by modal, even with true correlation, cannot fully capture the features within and between modals. To solve these problems, this paper proposes a Cross-Modal Complementary Network (CMCN) with hierarchical fusion for MSC. The CMCN is designed as a hierarchical structure with three key modules, namely, the feature extraction module to extract features from texts and images, the feature attention module to learn both text and image attention features generated by an image-text correlation generator, and the cross-modal hierarchical fusion module to fuse features within and between modals. Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features. Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.

KW - Cross-Modal Complementary Network (CMCN)

KW - hierarchical fusion

KW - joint optimization

KW - multimodal fusion

KW - multimodal sentiment analysis

UR - http://www.scopus.com/inward/record.url?scp=85121696912&partnerID=8YFLogxK

U2 - 10.26599/TST.2021.9010055

DO - 10.26599/TST.2021.9010055

M3 - Article

AN - SCOPUS:85121696912

SN - 1007-0214

VL - 27

SP - 664

EP - 679

JO - Tsinghua Science and Technology

JF - Tsinghua Science and Technology

IS - 4

ER -

Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this