Abstract
Audio-visual emotion recognition plays a crucial role in advancing human-computer interaction by enabling systems to perceive users’ emotional states. While recent advances have primarily focused on audio-visual feature fusion and alignment, existing approaches often overlook two critical challenges: (1) the alignment of audio-visual features across varying levels of granularity, and (2) the effective discrimination of hard negative sample with highly similar feature representations but belonging to different emotional categories. To address these limitations, we propose a novel audio-visual emotion recognition framework. First, we introduce a unified granularity contrastive learning strategy, which employs a shared vector space to harmonize features of different granularities, thereby enabling more consistent cross-modal alignment. Second, to improve class discrimination, particularly in the presence of hard negative samples, we propose a similar negative discrimination module that utilizes an auxiliary classification head to explicitly separate semantically similar but class-distinct samples across modalities. Extensive experiments conducted on two widely used benchmark datasets, CREMA-D and IEMOCAP, demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of the proposed method. Our source code is available at https://github.com/gaoweibit/multi-modal_emotion_recognition.
| Original language | English |
|---|---|
| Article number | 113224 |
| Journal | Pattern Recognition |
| Volume | 176 |
| DOIs | |
| Publication status | Published - Aug 2026 |
| Externally published | Yes |
Keywords
- Audio-visual emotion recognition
- Cross-attention
- Hard-negative samples
- Unified granularity tokens
Fingerprint
Dive into the research topics of 'Multimodal emotion recognition via unified granularity contrastive learning and similar negative discrimination'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver