WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge

Wenbin Wang, Liang Ding, Li Shen, Yong Luo*, Han Hu, Dacheng Tao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Citations (Scopus)

Abstract

Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for understanding human sentiment. Most of the existing works rely on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs), thereby restricting their ability to achieve better multimodal sentiment analysis (MSA). In this paper, we propose a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced MSA. WisdoM utilizes LVLMs to comprehensively analyze both images and corresponding texts, simultaneously generating pertinent context. Besides, to reduce the noise in the context, we design a training-free contextual fusion mechanism. We evaluate our WisdoM in both the aspect-level and sentence-level MSA tasks on the Twitter2015, Twitter2017, and MSED datasets. Experiments on three MSA benchmarks upon several advanced LVLMs, show that our approach brings consistent and significant improvements (up to +6.3% F1 score). Code is available at https://github.com/DreamMr/WisdoM.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages2282-2291
Number of pages10
ISBN (Electronic)9798400706868
DOIs
Publication statusPublished - 28 Oct 2024
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • contextual fusion
  • contextual world knowledge
  • large vision-language model
  • multimodal sentiment analysis

Cite this