MMAF: Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition

Jinhui Pang; Xinyun Yang; Xiaoyao Qiu; Zixuan Wang; Taisheng Huang

doi:10.3724/2096-7004.di.2024.0049

MMAF: Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition

Jinhui Pang^*, Xinyun Yang, Xiaoyao Qiu, Zixuan Wang, Taisheng Huang

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Multi-modal Named Entity Recognition (MNER), which is vision-language task, utilizes images as auxiliary to detect and classify named entities from input sentence. Recent studies find visual information is helpful for Named Entity Recognition (NER), while the difference between those two modalities is not carefully considered. Therefore, these approaches utilizing different pre-trained models do not reduce the gap between textual and visual features, which give the same weight of different modalities usually predict wrong because of the noise of visual information. To reduce these bias, we propose a Masked Multi-modal Attention Fusion approach for MNER, named MMAF. Firstly, we utilize Image Caption to generate textual representation of image, which is combined with original sentence. Then, to get textual and visual features, we map the multi-modal inputs into a shared space and stack Multi-modal Attention Fusion layer that performs fully interaction between two modalities. We add Multi-modal Attention Mask to highlight the importance of certain words in sentences, enhancing the performance of entity detection. Finally, we achieve Multi-modal Attention based representation for each word and perform entity labeling via CRF decoder. Experiments show our method outperforms state-of-the-art models by 0.23% and 0.84% on Twitter 2015 and 2017 MNER datasets respectively, demonstrating its effectiveness.

Original language	English
Pages (from-to)	1114-1133
Number of pages	20
Journal	Data Intelligence
Volume	6
Issue number	4
DOIs	https://doi.org/10.3724/2096-7004.di.2024.0049
Publication status	Published - 1 Dec 2024

Keywords

Attention mask
Image caption
Multi-head attention
Multi-modal named entity recognition
Reduce bias of visual features

Access to Document

10.3724/2096-7004.di.2024.0049

Cite this

Pang, J., Yang, X., Qiu, X., Wang, Z., & Huang, T. (2024). MMAF: Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition. Data Intelligence, 6(4), 1114-1133. https://doi.org/10.3724/2096-7004.di.2024.0049

@article{097e5bacf4b54fa592cccde366b46079,

title = "MMAF: Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition",

abstract = "Multi-modal Named Entity Recognition (MNER), which is vision-language task, utilizes images as auxiliary to detect and classify named entities from input sentence. Recent studies find visual information is helpful for Named Entity Recognition (NER), while the difference between those two modalities is not carefully considered. Therefore, these approaches utilizing different pre-trained models do not reduce the gap between textual and visual features, which give the same weight of different modalities usually predict wrong because of the noise of visual information. To reduce these bias, we propose a Masked Multi-modal Attention Fusion approach for MNER, named MMAF. Firstly, we utilize Image Caption to generate textual representation of image, which is combined with original sentence. Then, to get textual and visual features, we map the multi-modal inputs into a shared space and stack Multi-modal Attention Fusion layer that performs fully interaction between two modalities. We add Multi-modal Attention Mask to highlight the importance of certain words in sentences, enhancing the performance of entity detection. Finally, we achieve Multi-modal Attention based representation for each word and perform entity labeling via CRF decoder. Experiments show our method outperforms state-of-the-art models by 0.23% and 0.84% on Twitter 2015 and 2017 MNER datasets respectively, demonstrating its effectiveness.",

keywords = "Attention mask, Image caption, Multi-head attention, Multi-modal named entity recognition, Reduce bias of visual features",

author = "Jinhui Pang and Xinyun Yang and Xiaoyao Qiu and Zixuan Wang and Taisheng Huang",

note = "Publisher Copyright: {\textcopyright} 2024 Chinese Academy of Sciences.",

year = "2024",

month = dec,

day = "1",

doi = "10.3724/2096-7004.di.2024.0049",

language = "English",

volume = "6",

pages = "1114--1133",

journal = "Data Intelligence",

issn = "2096-7004",

publisher = "Science Press",

number = "4",

}

TY - JOUR

T1 - MMAF

T2 - Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition

AU - Pang, Jinhui

AU - Yang, Xinyun

AU - Qiu, Xiaoyao

AU - Wang, Zixuan

AU - Huang, Taisheng

PY - 2024/12/1

Y1 - 2024/12/1

N2 - Multi-modal Named Entity Recognition (MNER), which is vision-language task, utilizes images as auxiliary to detect and classify named entities from input sentence. Recent studies find visual information is helpful for Named Entity Recognition (NER), while the difference between those two modalities is not carefully considered. Therefore, these approaches utilizing different pre-trained models do not reduce the gap between textual and visual features, which give the same weight of different modalities usually predict wrong because of the noise of visual information. To reduce these bias, we propose a Masked Multi-modal Attention Fusion approach for MNER, named MMAF. Firstly, we utilize Image Caption to generate textual representation of image, which is combined with original sentence. Then, to get textual and visual features, we map the multi-modal inputs into a shared space and stack Multi-modal Attention Fusion layer that performs fully interaction between two modalities. We add Multi-modal Attention Mask to highlight the importance of certain words in sentences, enhancing the performance of entity detection. Finally, we achieve Multi-modal Attention based representation for each word and perform entity labeling via CRF decoder. Experiments show our method outperforms state-of-the-art models by 0.23% and 0.84% on Twitter 2015 and 2017 MNER datasets respectively, demonstrating its effectiveness.

AB - Multi-modal Named Entity Recognition (MNER), which is vision-language task, utilizes images as auxiliary to detect and classify named entities from input sentence. Recent studies find visual information is helpful for Named Entity Recognition (NER), while the difference between those two modalities is not carefully considered. Therefore, these approaches utilizing different pre-trained models do not reduce the gap between textual and visual features, which give the same weight of different modalities usually predict wrong because of the noise of visual information. To reduce these bias, we propose a Masked Multi-modal Attention Fusion approach for MNER, named MMAF. Firstly, we utilize Image Caption to generate textual representation of image, which is combined with original sentence. Then, to get textual and visual features, we map the multi-modal inputs into a shared space and stack Multi-modal Attention Fusion layer that performs fully interaction between two modalities. We add Multi-modal Attention Mask to highlight the importance of certain words in sentences, enhancing the performance of entity detection. Finally, we achieve Multi-modal Attention based representation for each word and perform entity labeling via CRF decoder. Experiments show our method outperforms state-of-the-art models by 0.23% and 0.84% on Twitter 2015 and 2017 MNER datasets respectively, demonstrating its effectiveness.

KW - Attention mask

KW - Image caption

KW - Multi-head attention

KW - Multi-modal named entity recognition

KW - Reduce bias of visual features

UR - http://www.scopus.com/inward/record.url?scp=85218440732&partnerID=8YFLogxK

U2 - 10.3724/2096-7004.di.2024.0049

DO - 10.3724/2096-7004.di.2024.0049

M3 - Article

AN - SCOPUS:85218440732

SN - 2096-7004

VL - 6

SP - 1114

EP - 1133

JO - Data Intelligence

JF - Data Intelligence

IS - 4

ER -

MMAF: Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this