MAPM: multiscale attention pre-training model for TextVQA

Yue Yang; Yue Yu; Yingying Li

doi:10.1007/s10489-024-05727-0

MAPM: multiscale attention pre-training model for TextVQA

Yue Yang, Yue Yu^*, Yingying Li

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Abstract: Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images. Graphical abstract: Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model’s expressive power and address the issue of cross modal semantic alignment (Figure presented.)

Original language	English
Pages (from-to)	10401-10413
Number of pages	13
Journal	Applied Intelligence
Volume	54
Issue number	21
DOIs	https://doi.org/10.1007/s10489-024-05727-0
Publication status	Published - Nov 2024

Keywords

Attention mechanisms
Cross-modal semantic alignment
Pre-training
Text visual question answering

Access to Document

10.1007/s10489-024-05727-0

Cite this

@article{e2195b6ee50d49cf8907dd4f0733fa9d,

title = "MAPM: multiscale attention pre-training model for TextVQA",

abstract = "Abstract: Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images. Graphical abstract: Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model{\textquoteright}s expressive power and address the issue of cross modal semantic alignment (Figure presented.)",

keywords = "Attention mechanisms, Cross-modal semantic alignment, Pre-training, Text visual question answering",

author = "Yue Yang and Yue Yu and Yingying Li",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.",

year = "2024",

month = nov,

doi = "10.1007/s10489-024-05727-0",

language = "English",

volume = "54",

pages = "10401--10413",

journal = "Applied Intelligence",

issn = "0924-669X",

publisher = "Springer Netherlands",

number = "21",

}

TY - JOUR

T1 - MAPM

T2 - multiscale attention pre-training model for TextVQA

AU - Yang, Yue

AU - Yu, Yue

AU - Li, Yingying

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

PY - 2024/11

Y1 - 2024/11

N2 - Abstract: Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images. Graphical abstract: Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model’s expressive power and address the issue of cross modal semantic alignment (Figure presented.)

AB - Abstract: Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images. Graphical abstract: Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model’s expressive power and address the issue of cross modal semantic alignment (Figure presented.)

KW - Attention mechanisms

KW - Cross-modal semantic alignment

KW - Pre-training

KW - Text visual question answering

UR - http://www.scopus.com/inward/record.url?scp=85201297621&partnerID=8YFLogxK

U2 - 10.1007/s10489-024-05727-0

DO - 10.1007/s10489-024-05727-0

M3 - Article

AN - SCOPUS:85201297621

SN - 0924-669X

VL - 54

SP - 10401

EP - 10413

JO - Applied Intelligence

JF - Applied Intelligence

IS - 21

ER -

MAPM: multiscale attention pre-training model for TextVQA

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this