Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

Chuanhao Li, Zhen Li, Chenchen Jing*, Yunde Jia, Yuwei Wu*

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

8 Citations (Scopus)

Abstract

Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much attention in the vision-and-language (V&L) community. It is essential to understand the effect of the primitives, including words, image regions, and video frames, to improve the compositional generalization capability. In this paper, we explore the effect of primitives for compositional generalization in V&L. Specifically, we present a self-supervised learning based framework that equips existing V&L methods with two characteristics: semantic equivariance and semantic invariance. With the two characteristics, the methods understand primitives by perceiving the effect of primitive changes on sample semantics and ground-truth. Experimental results on two tasks: temporal video grounding and visual question answering, demonstrate the effectiveness of our framework.

Original languageEnglish
Pages (from-to)19092-19101
Number of pages10
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2023-June
DOIs
Publication statusPublished - 2023
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, Canada
Duration: 18 Jun 202322 Jun 2023

Keywords

  • language
  • reasoning
  • Vision

Fingerprint

Dive into the research topics of 'Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language'. Together they form a unique fingerprint.

Cite this