MAPM: multiscale attention pre-training model for TextVQA

Yue Yang, Yue Yu*, Yingying Li

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

Abstract: Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images. Graphical abstract: Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model’s expressive power and address the issue of cross modal semantic alignment (Figure presented.)

源语言英语
页(从-至)10401-10413
页数13
期刊Applied Intelligence
54
21
DOI
出版状态已出版 - 11月 2024

指纹

探究 'MAPM: multiscale attention pre-training model for TextVQA' 的科研主题。它们共同构成独一无二的指纹。

引用此