Overcoming language priors in VQA via decomposed linguistic representations

Chenchen Jing; Yuwei Wu; Xiaoxun Zhang; Yunde Jia; Qi Wu

Overcoming language priors in VQA via decomposed linguistic representations

Chenchen Jing, Yuwei Wu^*, Xiaoxun Zhang, Yunde Jia, Qi Wu

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

77 引用（Scopus）

摘要

Most existing Visual Question Answering (VQA) models overly rely on language priors between questions and answers. In this paper, we present a novel method of language attention-based VQA that learns decomposed linguistic representations of questions and utilizes the representations to infer answers for overcoming language priors. We introduce a modular language attention mechanism to parse a question into three phrase representations: type representation, object representation, and concept representation. We use the type representation to identify the question type and the possible answer set (yes/no or specific concepts such as colors or numbers), and the object representation to focus on the relevant region of an image. The concept representation is verified with the attended region to infer the final answer. The proposed method decouples the language-based concept discovery and vision-based concept verification in the process of answer inference to prevent language priors from dominating the answering process. Experiments on the VQA-CP dataset demonstrate the effectiveness of our method.

源语言	英语
主期刊名	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
出版商	AAAI press
页	11181-11188
页数	8
ISBN（电子版）	9781577358350
出版状态	已出版 - 2020
活动	34th AAAI Conference on Artificial Intelligence, AAAI 2020 - New York, 美国期限: 7 2月 2020 → 12 2月 2020

出版系列

姓名	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

会议

会议	34th AAAI Conference on Artificial Intelligence, AAAI 2020
国家/地区	美国
市	New York
时期	7/02/20 → 12/02/20

其它文件与链接

链接到 Scopus 的出版物

引用此

@inproceedings{e568eec6aa3d40cf8bb31e118ea5493a,

title = "Overcoming language priors in VQA via decomposed linguistic representations",

abstract = "Most existing Visual Question Answering (VQA) models overly rely on language priors between questions and answers. In this paper, we present a novel method of language attention-based VQA that learns decomposed linguistic representations of questions and utilizes the representations to infer answers for overcoming language priors. We introduce a modular language attention mechanism to parse a question into three phrase representations: type representation, object representation, and concept representation. We use the type representation to identify the question type and the possible answer set (yes/no or specific concepts such as colors or numbers), and the object representation to focus on the relevant region of an image. The concept representation is verified with the attended region to infer the final answer. The proposed method decouples the language-based concept discovery and vision-based concept verification in the process of answer inference to prevent language priors from dominating the answering process. Experiments on the VQA-CP dataset demonstrate the effectiveness of our method.",

author = "Chenchen Jing and Yuwei Wu and Xiaoxun Zhang and Yunde Jia and Qi Wu",

note = "Publisher Copyright: Copyright {\textcopyright} 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 34th AAAI Conference on Artificial Intelligence, AAAI 2020 ; Conference date: 07-02-2020 Through 12-02-2020",

year = "2020",

language = "English",

series = "AAAI 2020 - 34th AAAI Conference on Artificial Intelligence",

publisher = "AAAI press",

pages = "11181--11188",

booktitle = "AAAI 2020 - 34th AAAI Conference on Artificial Intelligence",

}

TY - GEN

T1 - Overcoming language priors in VQA via decomposed linguistic representations

AU - Jing, Chenchen

AU - Wu, Yuwei

AU - Zhang, Xiaoxun

AU - Jia, Yunde

AU - Wu, Qi

PY - 2020

Y1 - 2020

N2 - Most existing Visual Question Answering (VQA) models overly rely on language priors between questions and answers. In this paper, we present a novel method of language attention-based VQA that learns decomposed linguistic representations of questions and utilizes the representations to infer answers for overcoming language priors. We introduce a modular language attention mechanism to parse a question into three phrase representations: type representation, object representation, and concept representation. We use the type representation to identify the question type and the possible answer set (yes/no or specific concepts such as colors or numbers), and the object representation to focus on the relevant region of an image. The concept representation is verified with the attended region to infer the final answer. The proposed method decouples the language-based concept discovery and vision-based concept verification in the process of answer inference to prevent language priors from dominating the answering process. Experiments on the VQA-CP dataset demonstrate the effectiveness of our method.

AB - Most existing Visual Question Answering (VQA) models overly rely on language priors between questions and answers. In this paper, we present a novel method of language attention-based VQA that learns decomposed linguistic representations of questions and utilizes the representations to infer answers for overcoming language priors. We introduce a modular language attention mechanism to parse a question into three phrase representations: type representation, object representation, and concept representation. We use the type representation to identify the question type and the possible answer set (yes/no or specific concepts such as colors or numbers), and the object representation to focus on the relevant region of an image. The concept representation is verified with the attended region to infer the final answer. The proposed method decouples the language-based concept discovery and vision-based concept verification in the process of answer inference to prevent language priors from dominating the answering process. Experiments on the VQA-CP dataset demonstrate the effectiveness of our method.

UR - http://www.scopus.com/inward/record.url?scp=85095296406&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85095296406

T3 - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

SP - 11181

EP - 11188

BT - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

PB - AAAI press

T2 - 34th AAAI Conference on Artificial Intelligence, AAAI 2020

Y2 - 7 February 2020 through 12 February 2020

ER -

Overcoming language priors in VQA via decomposed linguistic representations

摘要

出版系列

会议

其它文件与链接

指纹

引用此