TY - GEN
T1 - Multi-Sourced Compositional Generalization in Visual Question Answering
AU - Li, Chuanhao
AU - Ye, Wenbo
AU - Li, Zhen
AU - Wu, Yuwei
AU - Jia, Yunde
N1 - Publisher Copyright:
© 2025 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V&L) recently. Due to the multi-modal nature of V&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, i.e., multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. The GQA-MSCG dataset is available at https://github.com/NeverMoreLCH/MSCG.
AB - Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V&L) recently. Due to the multi-modal nature of V&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, i.e., multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. The GQA-MSCG dataset is available at https://github.com/NeverMoreLCH/MSCG.
UR - https://www.scopus.com/pages/publications/105021834745
U2 - 10.24963/ijcai.2025/150
DO - 10.24963/ijcai.2025/150
M3 - Conference contribution
AN - SCOPUS:105021834745
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 1341
EP - 1349
BT - Proceedings of the 34th International Joint Conference on Artificial Intelligence, IJCAI 2025
A2 - Kwok, James
PB - International Joint Conferences on Artificial Intelligence
T2 - 34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025
Y2 - 16 August 2025 through 22 August 2025
ER -