Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Jianjian Cao; Xiameng Qin; Sanyuan Zhao; Jianbing Shen

doi:10.1109/TNNLS.2021.3135655

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Jianjian Cao, Xiameng Qin, Sanyuan Zhao^*, Jianbing Shen

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.

源语言	英语
页（从-至）	4160-4171
页数	12
期刊	IEEE Transactions on Neural Networks and Learning Systems
卷	36
期	3
DOI	https://doi.org/10.1109/TNNLS.2021.3135655
出版状态	已出版 - 2025

访问文件

10.1109/TNNLS.2021.3135655

其它文件与链接

链接到 Scopus 的出版物

引用此

Cao, J., Qin, X., Zhao, S., & Shen, J. (2025). Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems, 36(3), 4160-4171. https://doi.org/10.1109/TNNLS.2021.3135655

@article{919573bba3e3446eaab6dce939906b40,

title = "Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering",

abstract = "Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.",

keywords = "Graph matching attention (GMA), relational reasoning, visual question answering (VQA)",

author = "Jianjian Cao and Xiameng Qin and Sanyuan Zhao and Jianbing Shen",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.",

year = "2025",

doi = "10.1109/TNNLS.2021.3135655",

language = "English",

volume = "36",

pages = "4160--4171",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "3",

}

TY - JOUR

T1 - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

AU - Cao, Jianjian

AU - Qin, Xiameng

AU - Zhao, Sanyuan

AU - Shen, Jianbing

PY - 2025

Y1 - 2025

N2 - Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.

AB - Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.

KW - Graph matching attention (GMA)

KW - relational reasoning

KW - visual question answering (VQA)

UR - http://www.scopus.com/inward/record.url?scp=86000426641&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2021.3135655

DO - 10.1109/TNNLS.2021.3135655

M3 - Article

AN - SCOPUS:86000426641

SN - 2162-237X

VL - 36

SP - 4160

EP - 4171

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 3

ER -

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

摘要

访问文件

其它文件与链接

指纹

引用此