A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

Yazhou Zhang; Jinglin Wang; Yaochen Liu; Lu Rong; Qian Zheng; Dawei Song; Prayag Tiwari; Jing Qin

doi:10.1016/j.inffus.2023.01.005

A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

Yazhou Zhang, Jinglin Wang, Yaochen Liu, Lu Rong, Qian Zheng^*, Dawei Song, Prayag Tiwari, Jing Qin

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

44 引用（Scopus）

摘要

Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal (I_a) attention and intermodal (I_e) attention. I_a attention is designed to capture the contextual dependency between adjacent utterances, while I_e attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal (O_r) self-attention mechanism. The main motivation of O_r attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.

源语言	英语
页（从-至）	282-301
页数	20
期刊	Information Fusion
卷	93
DOI	https://doi.org/10.1016/j.inffus.2023.01.005
出版状态	已出版 - 5月 2023

访问文件

10.1016/j.inffus.2023.01.005

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{b3a013d940b64ad58afc10b71a5288ae,

title = "A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations",

abstract = "Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal (Ia) attention and intermodal (Ie) attention. Ia attention is designed to capture the contextual dependency between adjacent utterances, while Ie attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal (Or) self-attention mechanism. The main motivation of Or attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.",

keywords = "Affective computing, Emotion recognition, Multimodal sarcasm recognition, Multitask learning, Sentiment analysis",

author = "Yazhou Zhang and Jinglin Wang and Yaochen Liu and Lu Rong and Qian Zheng and Dawei Song and Prayag Tiwari and Jing Qin",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2023",

month = may,

doi = "10.1016/j.inffus.2023.01.005",

language = "English",

volume = "93",

pages = "282--301",

journal = "Information Fusion",

issn = "1566-2535",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

AU - Zhang, Yazhou

AU - Wang, Jinglin

AU - Liu, Yaochen

AU - Rong, Lu

AU - Zheng, Qian

AU - Song, Dawei

AU - Tiwari, Prayag

AU - Qin, Jing

PY - 2023/5

Y1 - 2023/5

N2 - Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal (Ia) attention and intermodal (Ie) attention. Ia attention is designed to capture the contextual dependency between adjacent utterances, while Ie attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal (Or) self-attention mechanism. The main motivation of Or attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.

AB - Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal (Ia) attention and intermodal (Ie) attention. Ia attention is designed to capture the contextual dependency between adjacent utterances, while Ie attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal (Or) self-attention mechanism. The main motivation of Or attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.

KW - Affective computing

KW - Emotion recognition

KW - Multimodal sarcasm recognition

KW - Multitask learning

KW - Sentiment analysis

UR - http://www.scopus.com/inward/record.url?scp=85146144725&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2023.01.005

DO - 10.1016/j.inffus.2023.01.005

M3 - Article

AN - SCOPUS:85146144725

SN - 1566-2535

VL - 93

SP - 282

EP - 301

JO - Information Fusion

JF - Information Fusion

ER -

A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

摘要

访问文件

其它文件与链接

指纹

引用此