MMAF: Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition

Jinhui Pang*, Xinyun Yang, Xiaoyao Qiu, Zixuan Wang, Taisheng Huang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Multi-modal Named Entity Recognition (MNER), which is vision-language task, utilizes images as auxiliary to detect and classify named entities from input sentence. Recent studies find visual information is helpful for Named Entity Recognition (NER), while the difference between those two modalities is not carefully considered. Therefore, these approaches utilizing different pre-trained models do not reduce the gap between textual and visual features, which give the same weight of different modalities usually predict wrong because of the noise of visual information. To reduce these bias, we propose a Masked Multi-modal Attention Fusion approach for MNER, named MMAF. Firstly, we utilize Image Caption to generate textual representation of image, which is combined with original sentence. Then, to get textual and visual features, we map the multi-modal inputs into a shared space and stack Multi-modal Attention Fusion layer that performs fully interaction between two modalities. We add Multi-modal Attention Mask to highlight the importance of certain words in sentences, enhancing the performance of entity detection. Finally, we achieve Multi-modal Attention based representation for each word and perform entity labeling via CRF decoder. Experiments show our method outperforms state-of-the-art models by 0.23% and 0.84% on Twitter 2015 and 2017 MNER datasets respectively, demonstrating its effectiveness.

Original languageEnglish
Pages (from-to)1114-1133
Number of pages20
JournalData Intelligence
Volume6
Issue number4
DOIs
Publication statusPublished - 1 Dec 2024

Keywords

  • Attention mask
  • Image caption
  • Multi-head attention
  • Multi-modal named entity recognition
  • Reduce bias of visual features

Fingerprint

Dive into the research topics of 'MMAF: Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition'. Together they form a unique fingerprint.

Cite this

Pang, J., Yang, X., Qiu, X., Wang, Z., & Huang, T. (2024). MMAF: Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition. Data Intelligence, 6(4), 1114-1133. https://doi.org/10.3724/2096-7004.di.2024.0049