TY - JOUR
T1 - MMAF
T2 - Masked Multi-modal Attention Fusion to Reduce Bias of Visual Features for Named Entity Recognition
AU - Pang, Jinhui
AU - Yang, Xinyun
AU - Qiu, Xiaoyao
AU - Wang, Zixuan
AU - Huang, Taisheng
N1 - Publisher Copyright:
© 2024 Chinese Academy of Sciences.
PY - 2024/12/1
Y1 - 2024/12/1
N2 - Multi-modal Named Entity Recognition (MNER), which is vision-language task, utilizes images as auxiliary to detect and classify named entities from input sentence. Recent studies find visual information is helpful for Named Entity Recognition (NER), while the difference between those two modalities is not carefully considered. Therefore, these approaches utilizing different pre-trained models do not reduce the gap between textual and visual features, which give the same weight of different modalities usually predict wrong because of the noise of visual information. To reduce these bias, we propose a Masked Multi-modal Attention Fusion approach for MNER, named MMAF. Firstly, we utilize Image Caption to generate textual representation of image, which is combined with original sentence. Then, to get textual and visual features, we map the multi-modal inputs into a shared space and stack Multi-modal Attention Fusion layer that performs fully interaction between two modalities. We add Multi-modal Attention Mask to highlight the importance of certain words in sentences, enhancing the performance of entity detection. Finally, we achieve Multi-modal Attention based representation for each word and perform entity labeling via CRF decoder. Experiments show our method outperforms state-of-the-art models by 0.23% and 0.84% on Twitter 2015 and 2017 MNER datasets respectively, demonstrating its effectiveness.
AB - Multi-modal Named Entity Recognition (MNER), which is vision-language task, utilizes images as auxiliary to detect and classify named entities from input sentence. Recent studies find visual information is helpful for Named Entity Recognition (NER), while the difference between those two modalities is not carefully considered. Therefore, these approaches utilizing different pre-trained models do not reduce the gap between textual and visual features, which give the same weight of different modalities usually predict wrong because of the noise of visual information. To reduce these bias, we propose a Masked Multi-modal Attention Fusion approach for MNER, named MMAF. Firstly, we utilize Image Caption to generate textual representation of image, which is combined with original sentence. Then, to get textual and visual features, we map the multi-modal inputs into a shared space and stack Multi-modal Attention Fusion layer that performs fully interaction between two modalities. We add Multi-modal Attention Mask to highlight the importance of certain words in sentences, enhancing the performance of entity detection. Finally, we achieve Multi-modal Attention based representation for each word and perform entity labeling via CRF decoder. Experiments show our method outperforms state-of-the-art models by 0.23% and 0.84% on Twitter 2015 and 2017 MNER datasets respectively, demonstrating its effectiveness.
KW - Attention mask
KW - Image caption
KW - Multi-head attention
KW - Multi-modal named entity recognition
KW - Reduce bias of visual features
UR - http://www.scopus.com/inward/record.url?scp=85218440732&partnerID=8YFLogxK
U2 - 10.3724/2096-7004.di.2024.0049
DO - 10.3724/2096-7004.di.2024.0049
M3 - Article
AN - SCOPUS:85218440732
SN - 2096-7004
VL - 6
SP - 1114
EP - 1133
JO - Data Intelligence
JF - Data Intelligence
IS - 4
ER -