MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition

  • Jian Chen
  • , Yuxuan Hu
  • , Haifeng Lu
  • , Wei Wang
  • , Min Yang
  • , Chengming Li*
  • , Xiping Hu*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. Specifically, inspired by the human ability to interpret sticker emotions from multiple views, we first use Multimodal Large Language Models to interpret stickers by providing rich textual context via multi-view descriptions. Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding, which builds upon a pyramid visual transformer to extract both global and local sticker features at multiple stages. Through contrastive learning and attention mechanisms, textual features are injected at different stages of the visual backbone, enhancing the fusion of global- and local-granularity visual semantics with textual guidance. Finally, we introduce a text-guided fusion attention mechanism to effectively integrate the overall multimodal features, enhancing semantic understanding. Extensive experiments on 2 public sticker emotion datasets demonstrate that MGHFT significantly outperforms existing sticker emotion recognition approaches, achieving higher accuracy and more fine-grained emotion recognition. Compared to the best pre-trained visual models, our MGHFT also obtains an obvious improvement, 5.4% on F1 and 4.0% on accuracy. The code is released at https://github.com/cccccj-03/MGHFT\_ACMMM2025.

Original languageEnglish
Title of host publicationMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PublisherAssociation for Computing Machinery, Inc
Pages5794-5803
Number of pages10
ISBN (Electronic)9798400720352
DOIs
Publication statusPublished - 27 Oct 2025
Externally publishedYes
Event33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Publication series

NameMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Conference

Conference33rd ACM International Conference on Multimedia, MM 2025
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Keywords

  • multimodal fusion
  • sticker emotion recognition

Fingerprint

Dive into the research topics of 'MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition'. Together they form a unique fingerprint.

Cite this