What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

Dimitris Gkoumas; Qiuchi Li; Christina Lioma; Yijun Yu; Dawei Song

doi:10.1016/j.inffus.2020.09.005

What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

Dimitris Gkoumas^*, Qiuchi Li, Christina Lioma, Yijun Yu, Dawei Song

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

72 引用（Scopus）

摘要

Multimodal video sentiment analysis is a rapidly growing area. It combines verbal (i.e., linguistic) and non-verbal modalities (i.e., visual, acoustic) to predict the sentiment of utterances. A recent trend has been geared towards different modality fusion models utilizing various attention, memory and recurrent components. However, there lacks a systematic investigation on how these different components contribute to solving the problem as well as their limitations. This paper aims to fill the gap, marking the following key innovations. We present the first large-scale and comprehensive empirical comparison of eleven state-of-the-art (SOTA) modality fusion approaches in two video sentiment analysis tasks, with three SOTA benchmark corpora. An in-depth analysis of the results shows that the attention mechanisms are the most effective for modelling crossmodal interactions, yet they are computationally expensive. Second, additional levels of crossmodal interaction decrease performance. Third, positive sentiment utterances are the most challenging cases for all approaches. Finally, integrating context and utilizing the linguistic modality as a pivot for non-verbal modalities improve performance. We expect that the findings would provide helpful insights and guidance to the development of more effective modality fusion models.

源语言	英语
页（从-至）	184-197
页数	14
期刊	Information Fusion
卷	66
DOI	https://doi.org/10.1016/j.inffus.2020.09.005
出版状态	已出版 - 2月 2021

访问文件

10.1016/j.inffus.2020.09.005

其它文件与链接

链接到 Scopus 的出版物

引用此

Gkoumas, D., Li, Q., Lioma, C., Yu, Y., & Song, D. (2021). What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis. Information Fusion, 66, 184-197. https://doi.org/10.1016/j.inffus.2020.09.005

@article{bb0a773042274247ac06a2ba1e260f3f,

title = "What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis",

abstract = "Multimodal video sentiment analysis is a rapidly growing area. It combines verbal (i.e., linguistic) and non-verbal modalities (i.e., visual, acoustic) to predict the sentiment of utterances. A recent trend has been geared towards different modality fusion models utilizing various attention, memory and recurrent components. However, there lacks a systematic investigation on how these different components contribute to solving the problem as well as their limitations. This paper aims to fill the gap, marking the following key innovations. We present the first large-scale and comprehensive empirical comparison of eleven state-of-the-art (SOTA) modality fusion approaches in two video sentiment analysis tasks, with three SOTA benchmark corpora. An in-depth analysis of the results shows that the attention mechanisms are the most effective for modelling crossmodal interactions, yet they are computationally expensive. Second, additional levels of crossmodal interaction decrease performance. Third, positive sentiment utterances are the most challenging cases for all approaches. Finally, integrating context and utilizing the linguistic modality as a pivot for non-verbal modalities improve performance. We expect that the findings would provide helpful insights and guidance to the development of more effective modality fusion models.",

keywords = "Emotion recognition, Multimodal human language understanding, Reproducibility in multimodal machine learning, Video sentiment analysis",

author = "Dimitris Gkoumas and Qiuchi Li and Christina Lioma and Yijun Yu and Dawei Song",

note = "Publisher Copyright: {\textcopyright} 2020 Elsevier B.V.",

year = "2021",

month = feb,

doi = "10.1016/j.inffus.2020.09.005",

language = "English",

volume = "66",

pages = "184--197",

journal = "Information Fusion",

issn = "1566-2535",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

AU - Gkoumas, Dimitris

AU - Li, Qiuchi

AU - Lioma, Christina

AU - Yu, Yijun

AU - Song, Dawei

PY - 2021/2

Y1 - 2021/2

N2 - Multimodal video sentiment analysis is a rapidly growing area. It combines verbal (i.e., linguistic) and non-verbal modalities (i.e., visual, acoustic) to predict the sentiment of utterances. A recent trend has been geared towards different modality fusion models utilizing various attention, memory and recurrent components. However, there lacks a systematic investigation on how these different components contribute to solving the problem as well as their limitations. This paper aims to fill the gap, marking the following key innovations. We present the first large-scale and comprehensive empirical comparison of eleven state-of-the-art (SOTA) modality fusion approaches in two video sentiment analysis tasks, with three SOTA benchmark corpora. An in-depth analysis of the results shows that the attention mechanisms are the most effective for modelling crossmodal interactions, yet they are computationally expensive. Second, additional levels of crossmodal interaction decrease performance. Third, positive sentiment utterances are the most challenging cases for all approaches. Finally, integrating context and utilizing the linguistic modality as a pivot for non-verbal modalities improve performance. We expect that the findings would provide helpful insights and guidance to the development of more effective modality fusion models.

AB - Multimodal video sentiment analysis is a rapidly growing area. It combines verbal (i.e., linguistic) and non-verbal modalities (i.e., visual, acoustic) to predict the sentiment of utterances. A recent trend has been geared towards different modality fusion models utilizing various attention, memory and recurrent components. However, there lacks a systematic investigation on how these different components contribute to solving the problem as well as their limitations. This paper aims to fill the gap, marking the following key innovations. We present the first large-scale and comprehensive empirical comparison of eleven state-of-the-art (SOTA) modality fusion approaches in two video sentiment analysis tasks, with three SOTA benchmark corpora. An in-depth analysis of the results shows that the attention mechanisms are the most effective for modelling crossmodal interactions, yet they are computationally expensive. Second, additional levels of crossmodal interaction decrease performance. Third, positive sentiment utterances are the most challenging cases for all approaches. Finally, integrating context and utilizing the linguistic modality as a pivot for non-verbal modalities improve performance. We expect that the findings would provide helpful insights and guidance to the development of more effective modality fusion models.

KW - Emotion recognition

KW - Multimodal human language understanding

KW - Reproducibility in multimodal machine learning

KW - Video sentiment analysis

UR - http://www.scopus.com/inward/record.url?scp=85091217348&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2020.09.005

DO - 10.1016/j.inffus.2020.09.005

M3 - Article

AN - SCOPUS:85091217348

SN - 1566-2535

VL - 66

SP - 184

EP - 197

JO - Information Fusion

JF - Information Fusion

ER -

What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

摘要

访问文件

其它文件与链接

指纹

引用此