In-Context Compositional Generalization for Large Vision-Language Models

Chuanhao Li, Chenchen Jing, Zhen Li, Mingliang Zhai, Yuwei Wu*, Yunde Jia

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

Recent work has revealed that in-context learning for large language models exhibits compositional generalization capacity, which can be enhanced by selecting in-context demonstrations similar to test cases to provide contextual information. However, how to exhibit in-context compositional generalization (ICCG) of large vision-language models (LVLMs) is non-trival. Due to the inherent asymmetry between visual and linguistic modalities, ICCG in LVLMs faces an inevitable challenge-redundant information on the visual modality. The redundant information affects in-context learning from two aspects: (1) Similarity calculation may be dominated by redundant information, resulting in sub-optimal demonstration selection. (2) Redundant information in in-context demonstrations brings misleading contextual information to in-context learning. To alleviate these problems, we propose a demonstration selection method to achieve ICCG for LVLMs, by considering two key factors of demonstrations: content and structure, from a multimodal perspective. Specifically, we design a diversity-coverage-based matching score to select demonstrations with maximum coverage, and avoid selecting demonstrations with redundant information via their content redundancy and structural complexity. We build a GQA-ICCG dataset to simulate the ICCG setting, and conduct experiments on GQA-ICCG and the VQA v2 dataset. Experimental results demonstrate the effectiveness of our method.

Original languageEnglish
Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
PublisherAssociation for Computational Linguistics (ACL)
Pages17954-17966
Number of pages13
ISBN (Electronic)9798891761643
DOIs
Publication statusPublished - 2024
Event2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States
Duration: 12 Nov 202416 Nov 2024

Publication series

NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Country/TerritoryUnited States
CityHybrid, Miami
Period12/11/2416/11/24

Fingerprint

Dive into the research topics of 'In-Context Compositional Generalization for Large Vision-Language Models'. Together they form a unique fingerprint.

Cite this