TY - JOUR
T1 - Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language
AU - Li, Chuanhao
AU - Li, Zhen
AU - Jing, Chenchen
AU - Jia, Yunde
AU - Wu, Yuwei
N1 - Publisher Copyright:
©2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much attention in the vision-and-language (V&L) community. It is essential to understand the effect of the primitives, including words, image regions, and video frames, to improve the compositional generalization capability. In this paper, we explore the effect of primitives for compositional generalization in V&L. Specifically, we present a self-supervised learning based framework that equips existing V&L methods with two characteristics: semantic equivariance and semantic invariance. With the two characteristics, the methods understand primitives by perceiving the effect of primitive changes on sample semantics and ground-truth. Experimental results on two tasks: temporal video grounding and visual question answering, demonstrate the effectiveness of our framework.
AB - Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much attention in the vision-and-language (V&L) community. It is essential to understand the effect of the primitives, including words, image regions, and video frames, to improve the compositional generalization capability. In this paper, we explore the effect of primitives for compositional generalization in V&L. Specifically, we present a self-supervised learning based framework that equips existing V&L methods with two characteristics: semantic equivariance and semantic invariance. With the two characteristics, the methods understand primitives by perceiving the effect of primitive changes on sample semantics and ground-truth. Experimental results on two tasks: temporal video grounding and visual question answering, demonstrate the effectiveness of our framework.
KW - language
KW - reasoning
KW - Vision
UR - http://www.scopus.com/inward/record.url?scp=85217836476&partnerID=8YFLogxK
U2 - 10.1109/CVPR52729.2023.01830
DO - 10.1109/CVPR52729.2023.01830
M3 - Conference article
AN - SCOPUS:85217836476
SN - 1063-6919
VL - 2023-June
SP - 19092
EP - 19101
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Y2 - 18 June 2023 through 22 June 2023
ER -