KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning[Formula presented]

Dandan Song; Siyi Ma; Zhanchen Sun; Sicheng Yang; Lejian Liao

doi:10.1016/j.knosys.2021.107408

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning[Formula presented]

Dandan Song^*, Siyi Ma, Zhanchen Sun, Sicheng Yang, Lejian Liao

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

28 引用（Scopus）

摘要

Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. Therefore, we incorporate commonsense knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual and linguistic contents as input, external commonsense knowledge is integrated into the multi-layer Transformer. In order to preserve the structural information and semantic representation of the original sentence, we propose an algorithm called RMGSR (Relative-position-embedding and Mask-self-attention Guided Semantic Representations). Compared to other task-specific models and general task-agnostic pre-training models, our KVL-BERT outperforms them.

源语言	英语
文章编号	107408
期刊	Knowledge-Based Systems
卷	230
DOI	https://doi.org/10.1016/j.knosys.2021.107408
出版状态	已出版 - 27 10月 2021

访问文件

10.1016/j.knosys.2021.107408

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ef8d7aad3d3d41aab6c58dd5794f9351,

title = "KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning[Formula presented]",

abstract = "Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. Therefore, we incorporate commonsense knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual and linguistic contents as input, external commonsense knowledge is integrated into the multi-layer Transformer. In order to preserve the structural information and semantic representation of the original sentence, we propose an algorithm called RMGSR (Relative-position-embedding and Mask-self-attention Guided Semantic Representations). Compared to other task-specific models and general task-agnostic pre-training models, our KVL-BERT outperforms them.",

keywords = "Commonsense knowledge integration, Multimodal BERT, Visual commonsense reasoning",

author = "Dandan Song and Siyi Ma and Zhanchen Sun and Sicheng Yang and Lejian Liao",

note = "Publisher Copyright: {\textcopyright} 2021 Elsevier B.V.",

year = "2021",

month = oct,

day = "27",

doi = "10.1016/j.knosys.2021.107408",

language = "English",

volume = "230",

journal = "Knowledge-Based Systems",

issn = "0950-7051",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - KVL-BERT

T2 - Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning[Formula presented]

AU - Song, Dandan

AU - Ma, Siyi

AU - Sun, Zhanchen

AU - Yang, Sicheng

AU - Liao, Lejian

PY - 2021/10/27

Y1 - 2021/10/27

N2 - Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. Therefore, we incorporate commonsense knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual and linguistic contents as input, external commonsense knowledge is integrated into the multi-layer Transformer. In order to preserve the structural information and semantic representation of the original sentence, we propose an algorithm called RMGSR (Relative-position-embedding and Mask-self-attention Guided Semantic Representations). Compared to other task-specific models and general task-agnostic pre-training models, our KVL-BERT outperforms them.

AB - Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. Therefore, we incorporate commonsense knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual and linguistic contents as input, external commonsense knowledge is integrated into the multi-layer Transformer. In order to preserve the structural information and semantic representation of the original sentence, we propose an algorithm called RMGSR (Relative-position-embedding and Mask-self-attention Guided Semantic Representations). Compared to other task-specific models and general task-agnostic pre-training models, our KVL-BERT outperforms them.

KW - Commonsense knowledge integration

KW - Multimodal BERT

KW - Visual commonsense reasoning

UR - http://www.scopus.com/inward/record.url?scp=85113381187&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2021.107408

DO - 10.1016/j.knosys.2021.107408

M3 - Article

AN - SCOPUS:85113381187

SN - 0950-7051

VL - 230

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

M1 - 107408

ER -

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning[Formula presented]

摘要

访问文件

其它文件与链接

指纹

引用此