TY - JOUR
T1 - Graph-enhanced visual representations and question-guided dual attention for visual question answering
AU - Yusuf, Abdulganiyu Abdu
AU - Feng, Chong
AU - Mao, Xianling
AU - Haruna, Yunusa
AU - Li, Xinyan
AU - Duma, Ramadhani Ally
N1 - Publisher Copyright:
© 2024 Elsevier B.V.
PY - 2025/1/21
Y1 - 2025/1/21
N2 - Visual Question Answering (VQA) has witnessed significant advancements recently, due to the application of deep learning in the field of vision-language research. Most current VQA models focus on merging visual and text features, but it is essential for these models to also consider the relationships between different parts of an image and use question information to highlight important features. This study proposes a method to enhance neighboring image region features and learn question-aware visual representations. First, we construct a region graph to represent spatial relationships between objects in the image. Then, graph convolutional network (GCN) is used to propagate information across neighboring regions, enriching each region's feature representation by integrating contextual information. To capture long-range dependencies, the graph is enhanced with random walk with restart (RWR), enabling multi-hop reasoning across distant regions. Furthermore, a question-aware dual attention mechanism is introduced to further refine region features at both region and feature levels, ensuring that the model emphasizes key regions that are critical for answering the question. The enhanced region representations are then combined with the encoded question to predict an answer. Through extensive experiments on VQA benchmarks, the study demonstrates state-of-the-art performance by leveraging regional dependencies and question guidance. The integration of GCNs and random walks in the graph helps capture contextual information to focus visual attention selectively, resulting in significant improvements over existing methods on VQA 1.0 and VQA 2.0 benchmark datasets.
AB - Visual Question Answering (VQA) has witnessed significant advancements recently, due to the application of deep learning in the field of vision-language research. Most current VQA models focus on merging visual and text features, but it is essential for these models to also consider the relationships between different parts of an image and use question information to highlight important features. This study proposes a method to enhance neighboring image region features and learn question-aware visual representations. First, we construct a region graph to represent spatial relationships between objects in the image. Then, graph convolutional network (GCN) is used to propagate information across neighboring regions, enriching each region's feature representation by integrating contextual information. To capture long-range dependencies, the graph is enhanced with random walk with restart (RWR), enabling multi-hop reasoning across distant regions. Furthermore, a question-aware dual attention mechanism is introduced to further refine region features at both region and feature levels, ensuring that the model emphasizes key regions that are critical for answering the question. The enhanced region representations are then combined with the encoded question to predict an answer. Through extensive experiments on VQA benchmarks, the study demonstrates state-of-the-art performance by leveraging regional dependencies and question guidance. The integration of GCNs and random walks in the graph helps capture contextual information to focus visual attention selectively, resulting in significant improvements over existing methods on VQA 1.0 and VQA 2.0 benchmark datasets.
KW - Dual attention mechanism
KW - Enhanced feature representations
KW - Graph convolutional networks
KW - Random walk with restart
KW - Visual question answering
UR - http://www.scopus.com/inward/record.url?scp=85208990776&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2024.128850
DO - 10.1016/j.neucom.2024.128850
M3 - Article
AN - SCOPUS:85208990776
SN - 0925-2312
VL - 614
JO - Neurocomputing
JF - Neurocomputing
M1 - 128850
ER -