Graph neural networks for visual question answering: a systematic review

Abdulganiyu Abdu Yusuf; Chong Feng; Xianling Mao; Ramadhani Ally Duma; Mohammed Salah Abood; Abdulrahman Hamman Adama Chukkol

doi:10.1007/s11042-023-17594-x

Graph neural networks for visual question answering: a systematic review

Abdulganiyu Abdu Yusuf, Chong Feng^*, Xianling Mao, Ramadhani Ally Duma, Mohammed Salah Abood, Abdulrahman Hamman Adama Chukkol

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Recently, visual question answering (VQA) has gained considerable interest within the computer vision and natural language processing (NLP) research areas. The VQA task involves answering a question about an image, which requires both language and vision understanding. Effectively extracting visual representations from images, textual embedding from questions, and bridging the semantic disparity between image and question representations pose fundamental challenges in VQA. Lately, an increasing number of studies are focusing on utilizing graph neural networks (GNNs) to enhance the performance of VQA tasks. The ability to handle graph-structured data is a major advantage of GNNs for VQA tasks, which allows better representation of relationships between objects and regions in an image. These relationships include both spatial and semantic relationships. This paper systematically reviews various graph neural networks based studies for image-based VQA. Fifty-four related publications written between 2018—Jan. 2023 were carefully synthesized for this review. The review is structured into three perspectives: the various graph neural network techniques and models that have been applied for VQA, a comparison of the model's performance and existing challenges. After analyzing these papers, 45 different models were identified, grouped into four different GNN techniques. These are Graph Convolution Network (GCN), Graph Attention Network (GAT), Graph Isomorphism Network (GIN) and Graph Neural Network (GNN). Also, the performance of these models is compared based on accuracy, datasets, subtasks, feature representation and fusion techniques. Lastly, the study provided some possible suggestions to mitigate still existing challenges for future research in visual question answering.

源语言	英语
期刊	Multimedia Tools and Applications
DOI	https://doi.org/10.1007/s11042-023-17594-x
出版状态	已接受/待刊 - 2023

访问文件

10.1007/s11042-023-17594-x

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{d4ec1db627cb42618eb2caf52564c641,

title = "Graph neural networks for visual question answering: a systematic review",

abstract = "Recently, visual question answering (VQA) has gained considerable interest within the computer vision and natural language processing (NLP) research areas. The VQA task involves answering a question about an image, which requires both language and vision understanding. Effectively extracting visual representations from images, textual embedding from questions, and bridging the semantic disparity between image and question representations pose fundamental challenges in VQA. Lately, an increasing number of studies are focusing on utilizing graph neural networks (GNNs) to enhance the performance of VQA tasks. The ability to handle graph-structured data is a major advantage of GNNs for VQA tasks, which allows better representation of relationships between objects and regions in an image. These relationships include both spatial and semantic relationships. This paper systematically reviews various graph neural networks based studies for image-based VQA. Fifty-four related publications written between 2018—Jan. 2023 were carefully synthesized for this review. The review is structured into three perspectives: the various graph neural network techniques and models that have been applied for VQA, a comparison of the model's performance and existing challenges. After analyzing these papers, 45 different models were identified, grouped into four different GNN techniques. These are Graph Convolution Network (GCN), Graph Attention Network (GAT), Graph Isomorphism Network (GIN) and Graph Neural Network (GNN). Also, the performance of these models is compared based on accuracy, datasets, subtasks, feature representation and fusion techniques. Lastly, the study provided some possible suggestions to mitigate still existing challenges for future research in visual question answering.",

keywords = "Computer vision, Graph neural networks, Natural language processing, Visual question answering",

author = "Yusuf, {Abdulganiyu Abdu} and Chong Feng and Xianling Mao and {Ally Duma}, Ramadhani and Abood, {Mohammed Salah} and Chukkol, {Abdulrahman Hamman Adama}",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

doi = "10.1007/s11042-023-17594-x",

language = "English",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer Netherlands",

}

TY - JOUR

T1 - Graph neural networks for visual question answering

T2 - a systematic review

AU - Yusuf, Abdulganiyu Abdu

AU - Feng, Chong

AU - Mao, Xianling

AU - Ally Duma, Ramadhani

AU - Abood, Mohammed Salah

AU - Chukkol, Abdulrahman Hamman Adama

PY - 2023

Y1 - 2023

N2 - Recently, visual question answering (VQA) has gained considerable interest within the computer vision and natural language processing (NLP) research areas. The VQA task involves answering a question about an image, which requires both language and vision understanding. Effectively extracting visual representations from images, textual embedding from questions, and bridging the semantic disparity between image and question representations pose fundamental challenges in VQA. Lately, an increasing number of studies are focusing on utilizing graph neural networks (GNNs) to enhance the performance of VQA tasks. The ability to handle graph-structured data is a major advantage of GNNs for VQA tasks, which allows better representation of relationships between objects and regions in an image. These relationships include both spatial and semantic relationships. This paper systematically reviews various graph neural networks based studies for image-based VQA. Fifty-four related publications written between 2018—Jan. 2023 were carefully synthesized for this review. The review is structured into three perspectives: the various graph neural network techniques and models that have been applied for VQA, a comparison of the model's performance and existing challenges. After analyzing these papers, 45 different models were identified, grouped into four different GNN techniques. These are Graph Convolution Network (GCN), Graph Attention Network (GAT), Graph Isomorphism Network (GIN) and Graph Neural Network (GNN). Also, the performance of these models is compared based on accuracy, datasets, subtasks, feature representation and fusion techniques. Lastly, the study provided some possible suggestions to mitigate still existing challenges for future research in visual question answering.

AB - Recently, visual question answering (VQA) has gained considerable interest within the computer vision and natural language processing (NLP) research areas. The VQA task involves answering a question about an image, which requires both language and vision understanding. Effectively extracting visual representations from images, textual embedding from questions, and bridging the semantic disparity between image and question representations pose fundamental challenges in VQA. Lately, an increasing number of studies are focusing on utilizing graph neural networks (GNNs) to enhance the performance of VQA tasks. The ability to handle graph-structured data is a major advantage of GNNs for VQA tasks, which allows better representation of relationships between objects and regions in an image. These relationships include both spatial and semantic relationships. This paper systematically reviews various graph neural networks based studies for image-based VQA. Fifty-four related publications written between 2018—Jan. 2023 were carefully synthesized for this review. The review is structured into three perspectives: the various graph neural network techniques and models that have been applied for VQA, a comparison of the model's performance and existing challenges. After analyzing these papers, 45 different models were identified, grouped into four different GNN techniques. These are Graph Convolution Network (GCN), Graph Attention Network (GAT), Graph Isomorphism Network (GIN) and Graph Neural Network (GNN). Also, the performance of these models is compared based on accuracy, datasets, subtasks, feature representation and fusion techniques. Lastly, the study provided some possible suggestions to mitigate still existing challenges for future research in visual question answering.

KW - Computer vision

KW - Graph neural networks

KW - Natural language processing

KW - Visual question answering

UR - http://www.scopus.com/inward/record.url?scp=85176765505&partnerID=8YFLogxK

U2 - 10.1007/s11042-023-17594-x

DO - 10.1007/s11042-023-17594-x

M3 - Article

AN - SCOPUS:85176765505

SN - 1380-7501

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

ER -

Graph neural networks for visual question answering: a systematic review

摘要

访问文件

其它文件与链接

指纹

引用此