Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering

Yuhang Liu; Wei Wei; Daowan Peng; Xian Ling Mao; Zhiyong He; Pan Zhou

doi:10.1109/TMM.2022.3190686

Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering

Yuhang Liu, Wei Wei^*, Daowan Peng, Xian Ling Mao, Zhiyong He, Pan Zhou

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

5 引用（Scopus）

摘要

Visual relationship understanding plays an indispensable role in grounded language tasks like visual question answering (VQA), which often requires precisely reasoning about relations among objects depicted in the given question. However, prior works generally suffer from the deficiencies as follows, (1) spatial-relation inference ambiguity, it is challenging to accurately estimate the distance of a pair of visual objects in 2D space if there is a visual-overlap between their 2D bounding-boxes, and (2) language-visual relational alignment missing, it is insufficient to generate a high-quality answer to the question if there is a lack of alignment in the language-visual relations of objects during fusion, even using a powerful fusion model like Transformer. To this end, we first model the spatial relation of a pair of objects in 3D space by augmenting the original 2D bounding-box with 1D depth information, and then propose a novel model named Depth-aware Semantic Guided Relational Attention Network (DSGANet), to explicitly exploit the formed 3D spatial relations of objects in an intra-/inter-modality manner for precise relational alignment. Extensive experiments conducted on the benchmarks (VQA v2.0 and GQA) demonstrate DSGANet achieves competitive performance compared to pretrained and non-pretrained models, such as 72.7% vs. 74.6% based on the learned grid features on VQA v2.0.

源语言	英语
页（从-至）	5344-5357
页数	14
期刊	IEEE Transactions on Multimedia
卷	25
DOI	https://doi.org/10.1109/TMM.2022.3190686
出版状态	已出版 - 2023

访问文件

10.1109/TMM.2022.3190686

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{0a7f09bfdc904938aae94cd03a101ec0,

title = "Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering",

abstract = "Visual relationship understanding plays an indispensable role in grounded language tasks like visual question answering (VQA), which often requires precisely reasoning about relations among objects depicted in the given question. However, prior works generally suffer from the deficiencies as follows, (1) spatial-relation inference ambiguity, it is challenging to accurately estimate the distance of a pair of visual objects in 2D space if there is a visual-overlap between their 2D bounding-boxes, and (2) language-visual relational alignment missing, it is insufficient to generate a high-quality answer to the question if there is a lack of alignment in the language-visual relations of objects during fusion, even using a powerful fusion model like Transformer. To this end, we first model the spatial relation of a pair of objects in 3D space by augmenting the original 2D bounding-box with 1D depth information, and then propose a novel model named Depth-aware Semantic Guided Relational Attention Network (DSGANet), to explicitly exploit the formed 3D spatial relations of objects in an intra-/inter-modality manner for precise relational alignment. Extensive experiments conducted on the benchmarks (VQA v2.0 and GQA) demonstrate DSGANet achieves competitive performance compared to pretrained and non-pretrained models, such as 72.7% vs. 74.6% based on the learned grid features on VQA v2.0.",

keywords = "Depth estimation, multi-modal representation, relational reasoning, visual question answering",

author = "Yuhang Liu and Wei Wei and Daowan Peng and Mao, {Xian Ling} and Zhiyong He and Pan Zhou",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.",

year = "2023",

doi = "10.1109/TMM.2022.3190686",

language = "English",

volume = "25",

pages = "5344--5357",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering

AU - Liu, Yuhang

AU - Wei, Wei

AU - Peng, Daowan

AU - Mao, Xian Ling

AU - He, Zhiyong

AU - Zhou, Pan

PY - 2023

Y1 - 2023

N2 - Visual relationship understanding plays an indispensable role in grounded language tasks like visual question answering (VQA), which often requires precisely reasoning about relations among objects depicted in the given question. However, prior works generally suffer from the deficiencies as follows, (1) spatial-relation inference ambiguity, it is challenging to accurately estimate the distance of a pair of visual objects in 2D space if there is a visual-overlap between their 2D bounding-boxes, and (2) language-visual relational alignment missing, it is insufficient to generate a high-quality answer to the question if there is a lack of alignment in the language-visual relations of objects during fusion, even using a powerful fusion model like Transformer. To this end, we first model the spatial relation of a pair of objects in 3D space by augmenting the original 2D bounding-box with 1D depth information, and then propose a novel model named Depth-aware Semantic Guided Relational Attention Network (DSGANet), to explicitly exploit the formed 3D spatial relations of objects in an intra-/inter-modality manner for precise relational alignment. Extensive experiments conducted on the benchmarks (VQA v2.0 and GQA) demonstrate DSGANet achieves competitive performance compared to pretrained and non-pretrained models, such as 72.7% vs. 74.6% based on the learned grid features on VQA v2.0.

AB - Visual relationship understanding plays an indispensable role in grounded language tasks like visual question answering (VQA), which often requires precisely reasoning about relations among objects depicted in the given question. However, prior works generally suffer from the deficiencies as follows, (1) spatial-relation inference ambiguity, it is challenging to accurately estimate the distance of a pair of visual objects in 2D space if there is a visual-overlap between their 2D bounding-boxes, and (2) language-visual relational alignment missing, it is insufficient to generate a high-quality answer to the question if there is a lack of alignment in the language-visual relations of objects during fusion, even using a powerful fusion model like Transformer. To this end, we first model the spatial relation of a pair of objects in 3D space by augmenting the original 2D bounding-box with 1D depth information, and then propose a novel model named Depth-aware Semantic Guided Relational Attention Network (DSGANet), to explicitly exploit the formed 3D spatial relations of objects in an intra-/inter-modality manner for precise relational alignment. Extensive experiments conducted on the benchmarks (VQA v2.0 and GQA) demonstrate DSGANet achieves competitive performance compared to pretrained and non-pretrained models, such as 72.7% vs. 74.6% based on the learned grid features on VQA v2.0.

KW - Depth estimation

KW - multi-modal representation

KW - relational reasoning

KW - visual question answering

UR - http://www.scopus.com/inward/record.url?scp=85135236711&partnerID=8YFLogxK

U2 - 10.1109/TMM.2022.3190686

DO - 10.1109/TMM.2022.3190686

M3 - Article

AN - SCOPUS:85135236711

SN - 1520-9210

VL - 25

SP - 5344

EP - 5357

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering

摘要

访问文件

其它文件与链接

指纹

引用此