TY - GEN
T1 - Visual-Guided Reasoning Path Generation for Visual Question Answering
AU - Liu, Xinyu
AU - Jing, Chenchen
AU - Zhai, Mingliang
AU - Wu, Yuwei
AU - Jia, Yunde
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Neural module network (NMN) based methods have shown promising performance in visual question answering (VQA). However, existing methods have overlooked the potential existence of multiple reasoning paths for a given question. They generate one reasoning path for a question, which restricts the diversity in module combinations. Additionally, these methods generate reasoning paths solely based on questions, neglecting visual cues, which may lead to sub-optimal paths in multi-step reasoning scenarios. In this paper, we introduce the Visual-Guided Neural Module Network (V-NMN), a neuro-symbolic method that integrates visual information to enhance the model’s reasoning capabilities. Specifically, we utilize the reasoning capability of large language models (LLM) to generate all feasible reasoning paths for the questions in a few-shot manner. Then, we assess the suitability of these paths for the image and select the optimal one based on the assessment. The final answer is derived by executing the reasoning process along the selected path. We evaluate our method on the GQA dataset and CX-GQA, a test set that requires multi-step reasoning. Experimental results demonstrate its effectiveness in real-world scenarios.
AB - Neural module network (NMN) based methods have shown promising performance in visual question answering (VQA). However, existing methods have overlooked the potential existence of multiple reasoning paths for a given question. They generate one reasoning path for a question, which restricts the diversity in module combinations. Additionally, these methods generate reasoning paths solely based on questions, neglecting visual cues, which may lead to sub-optimal paths in multi-step reasoning scenarios. In this paper, we introduce the Visual-Guided Neural Module Network (V-NMN), a neuro-symbolic method that integrates visual information to enhance the model’s reasoning capabilities. Specifically, we utilize the reasoning capability of large language models (LLM) to generate all feasible reasoning paths for the questions in a few-shot manner. Then, we assess the suitability of these paths for the image and select the optimal one based on the assessment. The final answer is derived by executing the reasoning process along the selected path. We evaluate our method on the GQA dataset and CX-GQA, a test set that requires multi-step reasoning. Experimental results demonstrate its effectiveness in real-world scenarios.
KW - Large language model
KW - Neural module network
KW - Visual question answering
UR - http://www.scopus.com/inward/record.url?scp=85209590936&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-8487-5_12
DO - 10.1007/978-981-97-8487-5_12
M3 - Conference contribution
AN - SCOPUS:85209590936
SN - 9789819784868
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 167
EP - 180
BT - Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
A2 - Lin, Zhouchen
A2 - Zha, Hongbin
A2 - Cheng, Ming-Ming
A2 - He, Ran
A2 - Liu, Cheng-Lin
A2 - Ubul, Kurban
A2 - Silamu, Wushouer
A2 - Zhou, Jie
PB - Springer Science and Business Media Deutschland GmbH
T2 - 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Y2 - 18 October 2024 through 20 October 2024
ER -