Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Jianjian Cao; Xiameng Qin; Sanyuan Zhao; Jianbing Shen

doi:10.1109/TNNLS.2021.3135655

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Jianjian Cao, Xiameng Qin, Sanyuan Zhao, Jianbing Shen

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

18 引用（Scopus）

摘要

Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.

源语言	英语
期刊	IEEE Transactions on Neural Networks and Learning Systems
DOI	https://doi.org/10.1109/TNNLS.2021.3135655
出版状态	已接受/待刊 - 2022

访问文件

10.1109/TNNLS.2021.3135655

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{3429ee93a93e4903bb0a2a195e2c83f2,

title = "Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering",

abstract = "Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.",

keywords = "Cognition, Deep learning, Graph matching attention (GMA), Prediction algorithms, Semantics, Syntactics, Task analysis, Visualization, relational reasoning, visual question answering (VQA).",

author = "Jianjian Cao and Xiameng Qin and Sanyuan Zhao and Jianbing Shen",

note = "Publisher Copyright: IEEE",

year = "2022",

doi = "10.1109/TNNLS.2021.3135655",

language = "English",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

}

TY - JOUR

T1 - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

AU - Cao, Jianjian

AU - Qin, Xiameng

AU - Zhao, Sanyuan

AU - Shen, Jianbing

N1 - Publisher Copyright: IEEE

PY - 2022

Y1 - 2022

N2 - Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.

AB - Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.

KW - Cognition

KW - Deep learning

KW - Graph matching attention (GMA)

KW - Prediction algorithms

KW - Semantics

KW - Syntactics

KW - Task analysis

KW - Visualization

KW - relational reasoning

KW - visual question answering (VQA).

UR - http://www.scopus.com/inward/record.url?scp=85124748370&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2021.3135655

DO - 10.1109/TNNLS.2021.3135655

M3 - Article

AN - SCOPUS:85124748370

SN - 2162-237X

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

ER -

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

摘要

访问文件

其它文件与链接

指纹

引用此