A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

Yazhou Zhang; Jinglin Wang; Yaochen Liu; Lu Rong; Qian Zheng; Dawei Song; Prayag Tiwari; Jing Qin

doi:10.1016/j.inffus.2023.01.005

A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

Yazhou Zhang, Jinglin Wang, Yaochen Liu, Lu Rong, Qian Zheng^*, Dawei Song, Prayag Tiwari, Jing Qin

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

57 Citations (Scopus)

Abstract

Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal (I_a) attention and intermodal (I_e) attention. I_a attention is designed to capture the contextual dependency between adjacent utterances, while I_e attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal (O_r) self-attention mechanism. The main motivation of O_r attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.

Original language	English
Pages (from-to)	282-301
Number of pages	20
Journal	Information Fusion
Volume	93
DOIs	https://doi.org/10.1016/j.inffus.2023.01.005
Publication status	Published - May 2023

Keywords

Affective computing
Emotion recognition
Multimodal sarcasm recognition
Multitask learning
Sentiment analysis

Access to Document

10.1016/j.inffus.2023.01.005

Cite this

Zhang, Y., Wang, J., Liu, Y., Rong, L., Zheng, Q., Song, D., Tiwari, P., & Qin, J. (2023). A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations. Information Fusion, 93, 282-301. https://doi.org/10.1016/j.inffus.2023.01.005

@article{b3a013d940b64ad58afc10b71a5288ae,

title = "A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations",

abstract = "Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal (Ia) attention and intermodal (Ie) attention. Ia attention is designed to capture the contextual dependency between adjacent utterances, while Ie attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal (Or) self-attention mechanism. The main motivation of Or attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.",

keywords = "Affective computing, Emotion recognition, Multimodal sarcasm recognition, Multitask learning, Sentiment analysis",

author = "Yazhou Zhang and Jinglin Wang and Yaochen Liu and Lu Rong and Qian Zheng and Dawei Song and Prayag Tiwari and Jing Qin",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2023",

month = may,

doi = "10.1016/j.inffus.2023.01.005",

language = "English",

volume = "93",

pages = "282--301",

journal = "Information Fusion",

issn = "1566-2535",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

AU - Zhang, Yazhou

AU - Wang, Jinglin

AU - Liu, Yaochen

AU - Rong, Lu

AU - Zheng, Qian

AU - Song, Dawei

AU - Tiwari, Prayag

AU - Qin, Jing

PY - 2023/5

Y1 - 2023/5

N2 - Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal (Ia) attention and intermodal (Ie) attention. Ia attention is designed to capture the contextual dependency between adjacent utterances, while Ie attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal (Or) self-attention mechanism. The main motivation of Or attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.

AB - Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal (Ia) attention and intermodal (Ie) attention. Ia attention is designed to capture the contextual dependency between adjacent utterances, while Ie attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal (Or) self-attention mechanism. The main motivation of Or attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.

KW - Affective computing

KW - Emotion recognition

KW - Multimodal sarcasm recognition

KW - Multitask learning

KW - Sentiment analysis

UR - http://www.scopus.com/inward/record.url?scp=85146144725&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2023.01.005

DO - 10.1016/j.inffus.2023.01.005

M3 - Article

AN - SCOPUS:85146144725

SN - 1566-2535

VL - 93

SP - 282

EP - 301

JO - Information Fusion

JF - Information Fusion

ER -

A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this