TY - JOUR
T1 - MAPM
T2 - multiscale attention pre-training model for TextVQA
AU - Yang, Yue
AU - Yu, Yue
AU - Li, Yingying
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
PY - 2024/11
Y1 - 2024/11
N2 - Abstract: Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images. Graphical abstract: Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model’s expressive power and address the issue of cross modal semantic alignment (Figure presented.)
AB - Abstract: Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images. Graphical abstract: Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model’s expressive power and address the issue of cross modal semantic alignment (Figure presented.)
KW - Attention mechanisms
KW - Cross-modal semantic alignment
KW - Pre-training
KW - Text visual question answering
UR - http://www.scopus.com/inward/record.url?scp=85201297621&partnerID=8YFLogxK
U2 - 10.1007/s10489-024-05727-0
DO - 10.1007/s10489-024-05727-0
M3 - Article
AN - SCOPUS:85201297621
SN - 0924-669X
VL - 54
SP - 10401
EP - 10413
JO - Applied Intelligence
JF - Applied Intelligence
IS - 21
ER -