Skip to main navigation Skip to search Skip to main content

Multimodal emotion recognition via unified granularity contrastive learning and similar negative discrimination

  • Yongwei Li
  • , Wei Gao
  • , Jianwu Li*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Audio-visual emotion recognition plays a crucial role in advancing human-computer interaction by enabling systems to perceive users’ emotional states. While recent advances have primarily focused on audio-visual feature fusion and alignment, existing approaches often overlook two critical challenges: (1) the alignment of audio-visual features across varying levels of granularity, and (2) the effective discrimination of hard negative sample with highly similar feature representations but belonging to different emotional categories. To address these limitations, we propose a novel audio-visual emotion recognition framework. First, we introduce a unified granularity contrastive learning strategy, which employs a shared vector space to harmonize features of different granularities, thereby enabling more consistent cross-modal alignment. Second, to improve class discrimination, particularly in the presence of hard negative samples, we propose a similar negative discrimination module that utilizes an auxiliary classification head to explicitly separate semantically similar but class-distinct samples across modalities. Extensive experiments conducted on two widely used benchmark datasets, CREMA-D and IEMOCAP, demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of the proposed method. Our source code is available at https://github.com/gaoweibit/multi-modal_emotion_recognition.

Original languageEnglish
Article number113224
JournalPattern Recognition
Volume176
DOIs
Publication statusPublished - Aug 2026
Externally publishedYes

Keywords

  • Audio-visual emotion recognition
  • Cross-attention
  • Hard-negative samples
  • Unified granularity tokens

Fingerprint

Dive into the research topics of 'Multimodal emotion recognition via unified granularity contrastive learning and similar negative discrimination'. Together they form a unique fingerprint.

Cite this