From Subtle Hints to Grand Expressions - Mastering Fine-grained Emotions with Dynamic Multimodal Analysis

  • Qinfu Xu
  • , Liyuan Pan*
  • , Shaozu Yuan
  • , Yiwei Wei
  • , Chunlei Wu
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Multimodal Emotion Analysis (MEA) plays a crucial role in extracting and understanding emotional insights from diverse data sources, including text, video, and audio. However, existing methods may overlook the key issue that multimodal components exhibit asynchronism temporally and they obtain insufficient representation of fine-grained emotional expressions. In light of this, we propose a unified emotion reasoning model, EmoChat, which enhances multimodal emotion analysis by dynamically generating emotion-related tokens and fine-grained expression information through facial action modeling. To incorporate expression semantics, we design the AU Agent, a lightweight facial expression extractor, to provide LLMs with fine-grained facial knowledge for reasoning. In addition, we propose the Correlation Aggregator to alleviate the correlation differences between acoustic features and textual content. Therefore, our method decouples both the audio and vision modalities, allowing for efficient token-level emotion cues mining in misaligned multimodal input, while maintaining semantic consistency across different languages. Experiments on public benchmark datasets have demonstrated the superiority of our proposed EmoChat over the state-of-the-art methods.

Original languageEnglish
Title of host publicationMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PublisherAssociation for Computing Machinery, Inc
Pages5499-5508
Number of pages10
ISBN (Electronic)9798400720352
DOIs
Publication statusPublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Publication series

NameMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Conference

Conference33rd ACM International Conference on Multimedia, MM 2025
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Keywords

  • large vision-and-language models
  • multimodal emotion analysis

Fingerprint

Dive into the research topics of 'From Subtle Hints to Grand Expressions - Mastering Fine-grained Emotions with Dynamic Multimodal Analysis'. Together they form a unique fingerprint.

Cite this