TY - JOUR
T1 - Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering
AU - Liu, Yuhang
AU - Wei, Wei
AU - Peng, Daowan
AU - Mao, Xian Ling
AU - He, Zhiyong
AU - Zhou, Pan
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2023
Y1 - 2023
N2 - Visual relationship understanding plays an indispensable role in grounded language tasks like visual question answering (VQA), which often requires precisely reasoning about relations among objects depicted in the given question. However, prior works generally suffer from the deficiencies as follows, (1) spatial-relation inference ambiguity, it is challenging to accurately estimate the distance of a pair of visual objects in 2D space if there is a visual-overlap between their 2D bounding-boxes, and (2) language-visual relational alignment missing, it is insufficient to generate a high-quality answer to the question if there is a lack of alignment in the language-visual relations of objects during fusion, even using a powerful fusion model like Transformer. To this end, we first model the spatial relation of a pair of objects in 3D space by augmenting the original 2D bounding-box with 1D depth information, and then propose a novel model named Depth-aware Semantic Guided Relational Attention Network (DSGANet), to explicitly exploit the formed 3D spatial relations of objects in an intra-/inter-modality manner for precise relational alignment. Extensive experiments conducted on the benchmarks (VQA v2.0 and GQA) demonstrate DSGANet achieves competitive performance compared to pretrained and non-pretrained models, such as 72.7% vs. 74.6% based on the learned grid features on VQA v2.0.
AB - Visual relationship understanding plays an indispensable role in grounded language tasks like visual question answering (VQA), which often requires precisely reasoning about relations among objects depicted in the given question. However, prior works generally suffer from the deficiencies as follows, (1) spatial-relation inference ambiguity, it is challenging to accurately estimate the distance of a pair of visual objects in 2D space if there is a visual-overlap between their 2D bounding-boxes, and (2) language-visual relational alignment missing, it is insufficient to generate a high-quality answer to the question if there is a lack of alignment in the language-visual relations of objects during fusion, even using a powerful fusion model like Transformer. To this end, we first model the spatial relation of a pair of objects in 3D space by augmenting the original 2D bounding-box with 1D depth information, and then propose a novel model named Depth-aware Semantic Guided Relational Attention Network (DSGANet), to explicitly exploit the formed 3D spatial relations of objects in an intra-/inter-modality manner for precise relational alignment. Extensive experiments conducted on the benchmarks (VQA v2.0 and GQA) demonstrate DSGANet achieves competitive performance compared to pretrained and non-pretrained models, such as 72.7% vs. 74.6% based on the learned grid features on VQA v2.0.
KW - Depth estimation
KW - multi-modal representation
KW - relational reasoning
KW - visual question answering
UR - http://www.scopus.com/inward/record.url?scp=85135236711&partnerID=8YFLogxK
U2 - 10.1109/TMM.2022.3190686
DO - 10.1109/TMM.2022.3190686
M3 - Article
AN - SCOPUS:85135236711
SN - 1520-9210
VL - 25
SP - 5344
EP - 5357
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -