Vision, Voice, and Text: Pioneering Zero-shot Multimodal LLMs for Sentiment-driven Investment

  • Su Tan
  • , Chi Chiu So*
  • , Yueyue Sun
  • , Jun Min Wang
  • , Wai Keung Anthony Loh
  • , Siu Pang Yung
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In the rapidly evolving financial landscape, sentiment analysis has emerged as a critical tool for decoding market dynamics, yet traditional approaches remain confined to textual data, overlooking the rich multimodal cues embedded in audio and video. This paper unveils a pioneering zero-shot framework that harnesses Multimodal Large Language Models (MLLMs) to revolutionize sentiment-driven investment by integrating text, audio, and video modalities. We introduce a comprehensive suite of metrics to extract nuanced emotional signals, a self-consistent signal verification mechanism to enhance market prediction reliability, and a JSON schema for seamless automation. To validate this innovation, we curate the White House Press Briefing (WHPB) Video Benchmark Database, a novel dataset of 30 press briefings from January to July 2025, offering a robust testbed for multimodal analysis. Our extensive experiments demonstrate that the full-multimodal approach, leveraging text, audio, and video, outperforms text-only and text-audio baselines, achieving superior returns across diverse assets, including a remarkable 2,843.9% annualized return on the VIX. This work not only redefines financial sentiment analysis but also sets a transformative foundation for AI-driven investment strategies, empowering investors with unprecedented insights into market sentiment. Our WHPH database is available at https://github.com/sutan244/White-House-Press-Briefing-Video-Benchmark-Dataset-WHPB.

Original languageEnglish
Title of host publicationICAIF 2025 - 6th ACM International Conference on AI in Finance
PublisherAssociation for Computing Machinery, Inc
Pages960-968
Number of pages9
ISBN (Electronic)9798400722202
DOIs
Publication statusPublished - 14 Nov 2025
Event6th ACM International Conference on AI in Finance, ICAIF 2025 - Singapore, Singapore
Duration: 15 Nov 202518 Nov 2025

Publication series

NameICAIF 2025 - 6th ACM International Conference on AI in Finance

Conference

Conference6th ACM International Conference on AI in Finance, ICAIF 2025
Country/TerritorySingapore
CitySingapore
Period15/11/2518/11/25

Keywords

  • Multimodal Large Language Models (MLLMs)
  • Self-consistency
  • Sentiment Analysis
  • Zero-shot Prompting

Fingerprint

Dive into the research topics of 'Vision, Voice, and Text: Pioneering Zero-shot Multimodal LLMs for Sentiment-driven Investment'. Together they form a unique fingerprint.

Cite this