Visual-Guided Reasoning Path Generation for Visual Question Answering

Xinyu Liu*, Chenchen Jing, Mingliang Zhai, Yuwei Wu, Yunde Jia

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Neural module network (NMN) based methods have shown promising performance in visual question answering (VQA). However, existing methods have overlooked the potential existence of multiple reasoning paths for a given question. They generate one reasoning path for a question, which restricts the diversity in module combinations. Additionally, these methods generate reasoning paths solely based on questions, neglecting visual cues, which may lead to sub-optimal paths in multi-step reasoning scenarios. In this paper, we introduce the Visual-Guided Neural Module Network (V-NMN), a neuro-symbolic method that integrates visual information to enhance the model’s reasoning capabilities. Specifically, we utilize the reasoning capability of large language models (LLM) to generate all feasible reasoning paths for the questions in a few-shot manner. Then, we assess the suitability of these paths for the image and select the optimal one based on the assessment. The final answer is derived by executing the reasoning process along the selected path. We evaluate our method on the GQA dataset and CX-GQA, a test set that requires multi-step reasoning. Experimental results demonstrate its effectiveness in real-world scenarios.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
EditorsZhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
PublisherSpringer Science and Business Media Deutschland GmbH
Pages167-180
Number of pages14
ISBN (Print)9789819784868
DOIs
Publication statusPublished - 2025
Event7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, China
Duration: 18 Oct 202420 Oct 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15031 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Country/TerritoryChina
CityUrumqi
Period18/10/2420/10/24

Keywords

  • Large language model
  • Neural module network
  • Visual question answering

Fingerprint

Dive into the research topics of 'Visual-Guided Reasoning Path Generation for Visual Question Answering'. Together they form a unique fingerprint.

Cite this