Reinforcement Stacked Learning with Semantic-Associated Attention for Visual Question Answering

Xinyu Xiao; Chunxia Zhang; Shiming Xiang; Chunhong Pan

doi:10.1109/ICASSP39728.2021.9414636

Reinforcement Stacked Learning with Semantic-Associated Attention for Visual Question Answering

Xinyu Xiao, Chunxia Zhang, Shiming Xiang, Chunhong Pan

School of Computer Science and Technology

Research output: Contribution to journal › Conference article › peer-review

Abstract

The task of visual question answering (VQA) is to generate an answer for a question according to the content of an image being asked. In this process, the critical problems of effectively embedding the question feature and image feature as well as transforming the features to the prediction of answer are still faithfully unresolved. In this paper, depending on these problems, a semantic-associated attention method and a reinforcement stacked learning mechanism are proposed. Firstly, within the associations of high-level semantics, a visual spatial attention model (VSA) and a multi-semantic attention model (MSA) are proposed to extract the low-level image feature and high-level semantic feature, respectively. Furthermore, we develop a reinforcement stacked learning architecture, which splits the transformation process into multiple stages, to gradually approach the answers. At each stage, a new reinforcement learning (RL) method is introduced to directly criticize inappropriate answers to optimize the model. The extensive experiments on the VQA task show that our method can achieve state-of-the-art performance.

Original language	English
Pages (from-to)	4170-4174
Number of pages	5
Journal	Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
Volume	2021-June
DOIs	https://doi.org/10.1109/ICASSP39728.2021.9414636
Publication status	Published - 2021
Event	2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Toronto, Canada Duration: 6 Jun 2021 → 11 Jun 2021

Keywords

Attention
Deep learning
Lstm
Reinforcement learning
Vqa

Access to Document

10.1109/ICASSP39728.2021.9414636

Cite this

@article{cc32ff53c0e948a08b28113f0aacb0a6,

title = "Reinforcement Stacked Learning with Semantic-Associated Attention for Visual Question Answering",

abstract = "The task of visual question answering (VQA) is to generate an answer for a question according to the content of an image being asked. In this process, the critical problems of effectively embedding the question feature and image feature as well as transforming the features to the prediction of answer are still faithfully unresolved. In this paper, depending on these problems, a semantic-associated attention method and a reinforcement stacked learning mechanism are proposed. Firstly, within the associations of high-level semantics, a visual spatial attention model (VSA) and a multi-semantic attention model (MSA) are proposed to extract the low-level image feature and high-level semantic feature, respectively. Furthermore, we develop a reinforcement stacked learning architecture, which splits the transformation process into multiple stages, to gradually approach the answers. At each stage, a new reinforcement learning (RL) method is introduced to directly criticize inappropriate answers to optimize the model. The extensive experiments on the VQA task show that our method can achieve state-of-the-art performance.",

keywords = "Attention, Deep learning, Lstm, Reinforcement learning, Vqa",

author = "Xinyu Xiao and Chunxia Zhang and Shiming Xiang and Chunhong Pan",

note = "Publisher Copyright: {\textcopyright}2021 IEEE.; 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 ; Conference date: 06-06-2021 Through 11-06-2021",

year = "2021",

doi = "10.1109/ICASSP39728.2021.9414636",

language = "English",

volume = "2021-June",

pages = "4170--4174",

journal = "Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing",

issn = "0736-7791",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Reinforcement Stacked Learning with Semantic-Associated Attention for Visual Question Answering

AU - Xiao, Xinyu

AU - Zhang, Chunxia

AU - Xiang, Shiming

AU - Pan, Chunhong

PY - 2021

Y1 - 2021

N2 - The task of visual question answering (VQA) is to generate an answer for a question according to the content of an image being asked. In this process, the critical problems of effectively embedding the question feature and image feature as well as transforming the features to the prediction of answer are still faithfully unresolved. In this paper, depending on these problems, a semantic-associated attention method and a reinforcement stacked learning mechanism are proposed. Firstly, within the associations of high-level semantics, a visual spatial attention model (VSA) and a multi-semantic attention model (MSA) are proposed to extract the low-level image feature and high-level semantic feature, respectively. Furthermore, we develop a reinforcement stacked learning architecture, which splits the transformation process into multiple stages, to gradually approach the answers. At each stage, a new reinforcement learning (RL) method is introduced to directly criticize inappropriate answers to optimize the model. The extensive experiments on the VQA task show that our method can achieve state-of-the-art performance.

AB - The task of visual question answering (VQA) is to generate an answer for a question according to the content of an image being asked. In this process, the critical problems of effectively embedding the question feature and image feature as well as transforming the features to the prediction of answer are still faithfully unresolved. In this paper, depending on these problems, a semantic-associated attention method and a reinforcement stacked learning mechanism are proposed. Firstly, within the associations of high-level semantics, a visual spatial attention model (VSA) and a multi-semantic attention model (MSA) are proposed to extract the low-level image feature and high-level semantic feature, respectively. Furthermore, we develop a reinforcement stacked learning architecture, which splits the transformation process into multiple stages, to gradually approach the answers. At each stage, a new reinforcement learning (RL) method is introduced to directly criticize inappropriate answers to optimize the model. The extensive experiments on the VQA task show that our method can achieve state-of-the-art performance.

KW - Attention

KW - Deep learning

KW - Lstm

KW - Reinforcement learning

KW - Vqa

UR - http://www.scopus.com/inward/record.url?scp=85115192659&partnerID=8YFLogxK

U2 - 10.1109/ICASSP39728.2021.9414636

DO - 10.1109/ICASSP39728.2021.9414636

M3 - Conference article

AN - SCOPUS:85115192659

SN - 0736-7791

VL - 2021-June

SP - 4170

EP - 4174

JO - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

JF - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021

Y2 - 6 June 2021 through 11 June 2021

ER -

Reinforcement Stacked Learning with Semantic-Associated Attention for Visual Question Answering

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this