M3GAT: A Multi-modal, Multi-task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition

Yazhou Zhang; Ao Jia; Bo Wang; Peng Zhang; Dongming Zhao; Pu Li; Yuexian Hou; Xiaojia Jin; Dawei Song; Jing Qin

doi:10.1145/3593583

M3GAT: A Multi-modal, Multi-task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition

Yazhou Zhang, Ao Jia, Bo Wang, Peng Zhang, Dongming Zhao, Pu Li, Yuexian Hou, Xiaojia Jin, Dawei Song^*, Jing Qin^*

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

27 Citations (Scopus)

Abstract

Sentiment and emotion, which correspond to long-term and short-lived human feelings, are closely linked to each other, leading to the fact that sentiment analysis and emotion recognition are also two interdependent tasks in natural language processing (NLP). One task often leverages the shared knowledge from another task and performs better when solved in a joint learning paradigm. Conversational context dependency, multi-modal interaction, and multi-task correlation are three key factors that contribute to this joint paradigm. However, none of the recent approaches have considered them in a unified framework. To fill this gap, we propose a multi-modal, multi-task interactive graph attention network, termed M3GAT, to simultaneously solve the three problems. At the heart of the model is a proposed interactive conversation graph layer containing three core sub-modules, which are: (1) local-global context connection for modeling both local and global conversational context, (2) cross-modal connection for learning multi-modal complementary and (3) cross-task connection for capturing the correlation across two tasks. Comprehensive experiments on three benchmarking datasets, MELD, MEISD, and MSED, show the effectiveness of M3GAT over state-of-the-art baselines with the margin of 1.88%, 5.37%, and 0.19% for sentiment analysis, and 1.99%, 3.65%, and 0.13% for emotion recognition, respectively. In addition, we also show the superiority of multi-task learning over the single-task framework.

Original language	English
Article number	13
Journal	ACM Transactions on Information Systems
Volume	42
Issue number	1
DOIs	https://doi.org/10.1145/3593583
Publication status	Published - 21 Aug 2023

Keywords

Multi-modal sentiment analysis
emotion recognition
graph neural network
multi-task learning

Access to Document

10.1145/3593583

Cite this

Zhang, Y., Jia, A., Wang, B., Zhang, P., Zhao, D., Li, P., Hou, Y., Jin, X., Song, D., & Qin, J. (2023). M3GAT: A Multi-modal, Multi-task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition. ACM Transactions on Information Systems, 42(1), Article 13. https://doi.org/10.1145/3593583

@article{3cae33912d8d4b74ac6bc2d7c637ea0d,

title = "M3GAT: A Multi-modal, Multi-task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition",

abstract = "Sentiment and emotion, which correspond to long-term and short-lived human feelings, are closely linked to each other, leading to the fact that sentiment analysis and emotion recognition are also two interdependent tasks in natural language processing (NLP). One task often leverages the shared knowledge from another task and performs better when solved in a joint learning paradigm. Conversational context dependency, multi-modal interaction, and multi-task correlation are three key factors that contribute to this joint paradigm. However, none of the recent approaches have considered them in a unified framework. To fill this gap, we propose a multi-modal, multi-task interactive graph attention network, termed M3GAT, to simultaneously solve the three problems. At the heart of the model is a proposed interactive conversation graph layer containing three core sub-modules, which are: (1) local-global context connection for modeling both local and global conversational context, (2) cross-modal connection for learning multi-modal complementary and (3) cross-task connection for capturing the correlation across two tasks. Comprehensive experiments on three benchmarking datasets, MELD, MEISD, and MSED, show the effectiveness of M3GAT over state-of-the-art baselines with the margin of 1.88%, 5.37%, and 0.19% for sentiment analysis, and 1.99%, 3.65%, and 0.13% for emotion recognition, respectively. In addition, we also show the superiority of multi-task learning over the single-task framework.",

keywords = "Multi-modal sentiment analysis, emotion recognition, graph neural network, multi-task learning",

author = "Yazhou Zhang and Ao Jia and Bo Wang and Peng Zhang and Dongming Zhao and Pu Li and Yuexian Hou and Xiaojia Jin and Dawei Song and Jing Qin",

note = "Publisher Copyright: {\textcopyright} 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.",

year = "2023",

month = aug,

day = "21",

doi = "10.1145/3593583",

language = "English",

volume = "42",

journal = "ACM Transactions on Information Systems",

issn = "1046-8188",

publisher = "Association for Computing Machinery (ACM)",

number = "1",

}

TY - JOUR

T1 - M3GAT

T2 - A Multi-modal, Multi-task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition

AU - Zhang, Yazhou

AU - Jia, Ao

AU - Wang, Bo

AU - Zhang, Peng

AU - Zhao, Dongming

AU - Li, Pu

AU - Hou, Yuexian

AU - Jin, Xiaojia

AU - Song, Dawei

AU - Qin, Jing

PY - 2023/8/21

Y1 - 2023/8/21

N2 - Sentiment and emotion, which correspond to long-term and short-lived human feelings, are closely linked to each other, leading to the fact that sentiment analysis and emotion recognition are also two interdependent tasks in natural language processing (NLP). One task often leverages the shared knowledge from another task and performs better when solved in a joint learning paradigm. Conversational context dependency, multi-modal interaction, and multi-task correlation are three key factors that contribute to this joint paradigm. However, none of the recent approaches have considered them in a unified framework. To fill this gap, we propose a multi-modal, multi-task interactive graph attention network, termed M3GAT, to simultaneously solve the three problems. At the heart of the model is a proposed interactive conversation graph layer containing three core sub-modules, which are: (1) local-global context connection for modeling both local and global conversational context, (2) cross-modal connection for learning multi-modal complementary and (3) cross-task connection for capturing the correlation across two tasks. Comprehensive experiments on three benchmarking datasets, MELD, MEISD, and MSED, show the effectiveness of M3GAT over state-of-the-art baselines with the margin of 1.88%, 5.37%, and 0.19% for sentiment analysis, and 1.99%, 3.65%, and 0.13% for emotion recognition, respectively. In addition, we also show the superiority of multi-task learning over the single-task framework.

AB - Sentiment and emotion, which correspond to long-term and short-lived human feelings, are closely linked to each other, leading to the fact that sentiment analysis and emotion recognition are also two interdependent tasks in natural language processing (NLP). One task often leverages the shared knowledge from another task and performs better when solved in a joint learning paradigm. Conversational context dependency, multi-modal interaction, and multi-task correlation are three key factors that contribute to this joint paradigm. However, none of the recent approaches have considered them in a unified framework. To fill this gap, we propose a multi-modal, multi-task interactive graph attention network, termed M3GAT, to simultaneously solve the three problems. At the heart of the model is a proposed interactive conversation graph layer containing three core sub-modules, which are: (1) local-global context connection for modeling both local and global conversational context, (2) cross-modal connection for learning multi-modal complementary and (3) cross-task connection for capturing the correlation across two tasks. Comprehensive experiments on three benchmarking datasets, MELD, MEISD, and MSED, show the effectiveness of M3GAT over state-of-the-art baselines with the margin of 1.88%, 5.37%, and 0.19% for sentiment analysis, and 1.99%, 3.65%, and 0.13% for emotion recognition, respectively. In addition, we also show the superiority of multi-task learning over the single-task framework.

KW - Multi-modal sentiment analysis

KW - emotion recognition

KW - graph neural network

KW - multi-task learning

UR - http://www.scopus.com/inward/record.url?scp=85175605040&partnerID=8YFLogxK

U2 - 10.1145/3593583

DO - 10.1145/3593583

M3 - Article

AN - SCOPUS:85175605040

SN - 1046-8188

VL - 42

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

IS - 1

M1 - 13

ER -

M3GAT: A Multi-modal, Multi-task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this