TY - JOUR
T1 - KVL-BERT
T2 - Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning[Formula presented]
AU - Song, Dandan
AU - Ma, Siyi
AU - Sun, Zhanchen
AU - Yang, Sicheng
AU - Liao, Lejian
N1 - Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2021/10/27
Y1 - 2021/10/27
N2 - Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. Therefore, we incorporate commonsense knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual and linguistic contents as input, external commonsense knowledge is integrated into the multi-layer Transformer. In order to preserve the structural information and semantic representation of the original sentence, we propose an algorithm called RMGSR (Relative-position-embedding and Mask-self-attention Guided Semantic Representations). Compared to other task-specific models and general task-agnostic pre-training models, our KVL-BERT outperforms them.
AB - Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. Therefore, we incorporate commonsense knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual and linguistic contents as input, external commonsense knowledge is integrated into the multi-layer Transformer. In order to preserve the structural information and semantic representation of the original sentence, we propose an algorithm called RMGSR (Relative-position-embedding and Mask-self-attention Guided Semantic Representations). Compared to other task-specific models and general task-agnostic pre-training models, our KVL-BERT outperforms them.
KW - Commonsense knowledge integration
KW - Multimodal BERT
KW - Visual commonsense reasoning
UR - http://www.scopus.com/inward/record.url?scp=85113381187&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2021.107408
DO - 10.1016/j.knosys.2021.107408
M3 - Article
AN - SCOPUS:85113381187
SN - 0950-7051
VL - 230
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 107408
ER -