TY - JOUR
T1 - Multi-scale dual-stream visual feature extraction and graph reasoning for visual question answering
AU - Yusuf, Abdulganiyu Abdu
AU - Feng, Chong
AU - Mao, Xianling
AU - Li, Xinyan
AU - Haruna, Yunusa
AU - Duma, Ramadhani Ally
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025/4
Y1 - 2025/4
N2 - Recent advancements in deep learning algorithms have significantly expanded the capabilities of systems to handle vision-to-language (V2L) tasks. Visual question answering (VQA) presents challenges that require a deep understanding of visual and language content to perform complex reasoning tasks. The existing VQA models often rely on grid-based or region-based visual features, which capture global context and object-specific details, respectively. However, balancing the complementary strengths of each feature type while minimizing fusion noise remains a significant challenge. This study propose a multi-scale dual-stream visual feature extraction method that combines grid and region features to enhance both global and local visual feature representations. Also, a visual graph relational reasoning (VGRR) approach is proposed to further improve reasoning by constructing a graph that models spatial and semantic relationships between visual objects, using Graph Attention Networks (GATs) for relational reasoning. To enhance the interaction between visual and textual modalities, we further propose a cross-modal self-attention fusion strategy, which enables the model to focus selectively on the most relevant parts of both the image and the question. The proposed model is evaluated on the VQA 2.0 and GQA benchmark datasets, demonstrating competitive performance with significant accuracy improvements compared to state-of-the-art methods. Ablation studies confirm the effectiveness of each module in enhancing visual-textual understanding and answer prediction.
AB - Recent advancements in deep learning algorithms have significantly expanded the capabilities of systems to handle vision-to-language (V2L) tasks. Visual question answering (VQA) presents challenges that require a deep understanding of visual and language content to perform complex reasoning tasks. The existing VQA models often rely on grid-based or region-based visual features, which capture global context and object-specific details, respectively. However, balancing the complementary strengths of each feature type while minimizing fusion noise remains a significant challenge. This study propose a multi-scale dual-stream visual feature extraction method that combines grid and region features to enhance both global and local visual feature representations. Also, a visual graph relational reasoning (VGRR) approach is proposed to further improve reasoning by constructing a graph that models spatial and semantic relationships between visual objects, using Graph Attention Networks (GATs) for relational reasoning. To enhance the interaction between visual and textual modalities, we further propose a cross-modal self-attention fusion strategy, which enables the model to focus selectively on the most relevant parts of both the image and the question. The proposed model is evaluated on the VQA 2.0 and GQA benchmark datasets, demonstrating competitive performance with significant accuracy improvements compared to state-of-the-art methods. Ablation studies confirm the effectiveness of each module in enhancing visual-textual understanding and answer prediction.
KW - Attention mechanisms
KW - Dual-stream features
KW - Visual graph reasoning
KW - Visual question answering
KW - Visual semantics
UR - http://www.scopus.com/inward/record.url?scp=105000218884&partnerID=8YFLogxK
U2 - 10.1007/s10489-025-06325-4
DO - 10.1007/s10489-025-06325-4
M3 - Article
AN - SCOPUS:105000218884
SN - 0924-669X
VL - 55
JO - Applied Intelligence
JF - Applied Intelligence
IS - 6
M1 - 544
ER -