TY - GEN
T1 - Vision, Voice, and Text
T2 - 6th ACM International Conference on AI in Finance, ICAIF 2025
AU - Tan, Su
AU - So, Chi Chiu
AU - Sun, Yueyue
AU - Wang, Jun Min
AU - Loh, Wai Keung Anthony
AU - Yung, Siu Pang
N1 - Publisher Copyright:
© 2025 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/11/14
Y1 - 2025/11/14
N2 - In the rapidly evolving financial landscape, sentiment analysis has emerged as a critical tool for decoding market dynamics, yet traditional approaches remain confined to textual data, overlooking the rich multimodal cues embedded in audio and video. This paper unveils a pioneering zero-shot framework that harnesses Multimodal Large Language Models (MLLMs) to revolutionize sentiment-driven investment by integrating text, audio, and video modalities. We introduce a comprehensive suite of metrics to extract nuanced emotional signals, a self-consistent signal verification mechanism to enhance market prediction reliability, and a JSON schema for seamless automation. To validate this innovation, we curate the White House Press Briefing (WHPB) Video Benchmark Database, a novel dataset of 30 press briefings from January to July 2025, offering a robust testbed for multimodal analysis. Our extensive experiments demonstrate that the full-multimodal approach, leveraging text, audio, and video, outperforms text-only and text-audio baselines, achieving superior returns across diverse assets, including a remarkable 2,843.9% annualized return on the VIX. This work not only redefines financial sentiment analysis but also sets a transformative foundation for AI-driven investment strategies, empowering investors with unprecedented insights into market sentiment. Our WHPH database is available at https://github.com/sutan244/White-House-Press-Briefing-Video-Benchmark-Dataset-WHPB.
AB - In the rapidly evolving financial landscape, sentiment analysis has emerged as a critical tool for decoding market dynamics, yet traditional approaches remain confined to textual data, overlooking the rich multimodal cues embedded in audio and video. This paper unveils a pioneering zero-shot framework that harnesses Multimodal Large Language Models (MLLMs) to revolutionize sentiment-driven investment by integrating text, audio, and video modalities. We introduce a comprehensive suite of metrics to extract nuanced emotional signals, a self-consistent signal verification mechanism to enhance market prediction reliability, and a JSON schema for seamless automation. To validate this innovation, we curate the White House Press Briefing (WHPB) Video Benchmark Database, a novel dataset of 30 press briefings from January to July 2025, offering a robust testbed for multimodal analysis. Our extensive experiments demonstrate that the full-multimodal approach, leveraging text, audio, and video, outperforms text-only and text-audio baselines, achieving superior returns across diverse assets, including a remarkable 2,843.9% annualized return on the VIX. This work not only redefines financial sentiment analysis but also sets a transformative foundation for AI-driven investment strategies, empowering investors with unprecedented insights into market sentiment. Our WHPH database is available at https://github.com/sutan244/White-House-Press-Briefing-Video-Benchmark-Dataset-WHPB.
KW - Multimodal Large Language Models (MLLMs)
KW - Self-consistency
KW - Sentiment Analysis
KW - Zero-shot Prompting
UR - https://www.scopus.com/pages/publications/105023054020
U2 - 10.1145/3768292.3770368
DO - 10.1145/3768292.3770368
M3 - Conference contribution
AN - SCOPUS:105023054020
T3 - ICAIF 2025 - 6th ACM International Conference on AI in Finance
SP - 960
EP - 968
BT - ICAIF 2025 - 6th ACM International Conference on AI in Finance
PB - Association for Computing Machinery, Inc
Y2 - 15 November 2025 through 18 November 2025
ER -