Multi-Sourced Compositional Generalization in Visual Question Answering

  • Chuanhao Li
  • , Wenbo Ye
  • , Zhen Li
  • , Yuwei Wu*
  • , Yunde Jia
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V&L) recently. Due to the multi-modal nature of V&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, i.e., multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. The GQA-MSCG dataset is available at https://github.com/NeverMoreLCH/MSCG.

Original languageEnglish
Title of host publicationProceedings of the 34th International Joint Conference on Artificial Intelligence, IJCAI 2025
EditorsJames Kwok
PublisherInternational Joint Conferences on Artificial Intelligence
Pages1341-1349
Number of pages9
ISBN (Electronic)9781956792065
DOIs
Publication statusPublished - 2025
Externally publishedYes
Event34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025 - Montreal, Canada
Duration: 16 Aug 202522 Aug 2025

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
ISSN (Print)1045-0823

Conference

Conference34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025
Country/TerritoryCanada
CityMontreal
Period16/08/2522/08/25

Fingerprint

Dive into the research topics of 'Multi-Sourced Compositional Generalization in Visual Question Answering'. Together they form a unique fingerprint.

Cite this