TY - GEN
T1 - WisdoM
T2 - 32nd ACM International Conference on Multimedia, MM 2024
AU - Wang, Wenbin
AU - Ding, Liang
AU - Shen, Li
AU - Luo, Yong
AU - Hu, Han
AU - Tao, Dacheng
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for understanding human sentiment. Most of the existing works rely on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs), thereby restricting their ability to achieve better multimodal sentiment analysis (MSA). In this paper, we propose a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced MSA. WisdoM utilizes LVLMs to comprehensively analyze both images and corresponding texts, simultaneously generating pertinent context. Besides, to reduce the noise in the context, we design a training-free contextual fusion mechanism. We evaluate our WisdoM in both the aspect-level and sentence-level MSA tasks on the Twitter2015, Twitter2017, and MSED datasets. Experiments on three MSA benchmarks upon several advanced LVLMs, show that our approach brings consistent and significant improvements (up to +6.3% F1 score). Code is available at https://github.com/DreamMr/WisdoM.
AB - Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for understanding human sentiment. Most of the existing works rely on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs), thereby restricting their ability to achieve better multimodal sentiment analysis (MSA). In this paper, we propose a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced MSA. WisdoM utilizes LVLMs to comprehensively analyze both images and corresponding texts, simultaneously generating pertinent context. Besides, to reduce the noise in the context, we design a training-free contextual fusion mechanism. We evaluate our WisdoM in both the aspect-level and sentence-level MSA tasks on the Twitter2015, Twitter2017, and MSED datasets. Experiments on three MSA benchmarks upon several advanced LVLMs, show that our approach brings consistent and significant improvements (up to +6.3% F1 score). Code is available at https://github.com/DreamMr/WisdoM.
KW - contextual fusion
KW - contextual world knowledge
KW - large vision-language model
KW - multimodal sentiment analysis
UR - http://www.scopus.com/inward/record.url?scp=85209826235&partnerID=8YFLogxK
U2 - 10.1145/3664647.3681403
DO - 10.1145/3664647.3681403
M3 - Conference contribution
AN - SCOPUS:85209826235
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 2282
EP - 2291
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2024 through 1 November 2024
ER -