MAPM: multiscale attention pre-training model for TextVQA

Yue Yang, Yue Yu*, Yingying Li

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Abstract: Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images. Graphical abstract: Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model’s expressive power and address the issue of cross modal semantic alignment (Figure presented.)

Original languageEnglish
Pages (from-to)10401-10413
Number of pages13
JournalApplied Intelligence
Volume54
Issue number21
DOIs
Publication statusPublished - Nov 2024

Keywords

  • Attention mechanisms
  • Cross-modal semantic alignment
  • Pre-training
  • Text visual question answering

Fingerprint

Dive into the research topics of 'MAPM: multiscale attention pre-training model for TextVQA'. Together they form a unique fingerprint.

Cite this