Reinforcement Stacked Learning with Semantic-Associated Attention for Visual Question Answering

Xinyu Xiao, Chunxia Zhang, Shiming Xiang, Chunhong Pan

Research output: Contribution to journalConference articlepeer-review

Abstract

The task of visual question answering (VQA) is to generate an answer for a question according to the content of an image being asked. In this process, the critical problems of effectively embedding the question feature and image feature as well as transforming the features to the prediction of answer are still faithfully unresolved. In this paper, depending on these problems, a semantic-associated attention method and a reinforcement stacked learning mechanism are proposed. Firstly, within the associations of high-level semantics, a visual spatial attention model (VSA) and a multi-semantic attention model (MSA) are proposed to extract the low-level image feature and high-level semantic feature, respectively. Furthermore, we develop a reinforcement stacked learning architecture, which splits the transformation process into multiple stages, to gradually approach the answers. At each stage, a new reinforcement learning (RL) method is introduced to directly criticize inappropriate answers to optimize the model. The extensive experiments on the VQA task show that our method can achieve state-of-the-art performance.

Original languageEnglish
Pages (from-to)4170-4174
Number of pages5
JournalProceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
Volume2021-June
DOIs
Publication statusPublished - 2021
Event2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Toronto, Canada
Duration: 6 Jun 202111 Jun 2021

Keywords

  • Attention
  • Deep learning
  • Lstm
  • Reinforcement learning
  • Vqa

Fingerprint

Dive into the research topics of 'Reinforcement Stacked Learning with Semantic-Associated Attention for Visual Question Answering'. Together they form a unique fingerprint.

Cite this