TY - GEN
T1 - From Subtle Hints to Grand Expressions - Mastering Fine-grained Emotions with Dynamic Multimodal Analysis
AU - Xu, Qinfu
AU - Pan, Liyuan
AU - Yuan, Shaozu
AU - Wei, Yiwei
AU - Wu, Chunlei
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - Multimodal Emotion Analysis (MEA) plays a crucial role in extracting and understanding emotional insights from diverse data sources, including text, video, and audio. However, existing methods may overlook the key issue that multimodal components exhibit asynchronism temporally and they obtain insufficient representation of fine-grained emotional expressions. In light of this, we propose a unified emotion reasoning model, EmoChat, which enhances multimodal emotion analysis by dynamically generating emotion-related tokens and fine-grained expression information through facial action modeling. To incorporate expression semantics, we design the AU Agent, a lightweight facial expression extractor, to provide LLMs with fine-grained facial knowledge for reasoning. In addition, we propose the Correlation Aggregator to alleviate the correlation differences between acoustic features and textual content. Therefore, our method decouples both the audio and vision modalities, allowing for efficient token-level emotion cues mining in misaligned multimodal input, while maintaining semantic consistency across different languages. Experiments on public benchmark datasets have demonstrated the superiority of our proposed EmoChat over the state-of-the-art methods.
AB - Multimodal Emotion Analysis (MEA) plays a crucial role in extracting and understanding emotional insights from diverse data sources, including text, video, and audio. However, existing methods may overlook the key issue that multimodal components exhibit asynchronism temporally and they obtain insufficient representation of fine-grained emotional expressions. In light of this, we propose a unified emotion reasoning model, EmoChat, which enhances multimodal emotion analysis by dynamically generating emotion-related tokens and fine-grained expression information through facial action modeling. To incorporate expression semantics, we design the AU Agent, a lightweight facial expression extractor, to provide LLMs with fine-grained facial knowledge for reasoning. In addition, we propose the Correlation Aggregator to alleviate the correlation differences between acoustic features and textual content. Therefore, our method decouples both the audio and vision modalities, allowing for efficient token-level emotion cues mining in misaligned multimodal input, while maintaining semantic consistency across different languages. Experiments on public benchmark datasets have demonstrated the superiority of our proposed EmoChat over the state-of-the-art methods.
KW - large vision-and-language models
KW - multimodal emotion analysis
UR - https://www.scopus.com/pages/publications/105024070797
U2 - 10.1145/3746027.3754762
DO - 10.1145/3746027.3754762
M3 - Conference contribution
AN - SCOPUS:105024070797
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 5499
EP - 5508
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
T2 - 33rd ACM International Conference on Multimedia, MM 2025
Y2 - 27 October 2025 through 31 October 2025
ER -