TY - GEN
T1 - Relational Distant Supervision for Image Captioning without Image-Text Pairs
AU - Qi, Yayun
AU - Zhao, Wentian
AU - Wu, Xinxiao
N1 - Publisher Copyright:
Copyright © 2024, Association for the Advancement of Artificial Intelligence.
PY - 2024/3/25
Y1 - 2024/3/25
N2 - Unsupervised image captioning aims to generate descriptions of images without relying on any image-sentence pairs for training. Most existing works use detected visual objects or concepts as bridge to connect images and texts. Considering that the relationship between objects carries more information, we use the object relationship as a more accurate connection between images and texts. In this paper, we adapt the idea of distant supervision that extracts the knowledge about object relationships from an external corpus and imparts them to images to facilitate inferring visual object relationships, without introducing any extra pre-trained relationship detectors. Based on these learned informative relationships, we construct pseudo image-sentence pairs for captioning model training. Specifically, our method consists of three modules: (i) a relationship learning module that learns to infer relationships from images under the distant supervision; (ii) a relationship-to-sentence module that transforms the inferred relationships into sentences to generate pseudo image-sentence pairs; (iii) an image captioning module that is trained by using the generated image-sentence pairs. Promising results on three datasets show that our method outperforms the state-of-the-art methods of unsupervised image captioning.
AB - Unsupervised image captioning aims to generate descriptions of images without relying on any image-sentence pairs for training. Most existing works use detected visual objects or concepts as bridge to connect images and texts. Considering that the relationship between objects carries more information, we use the object relationship as a more accurate connection between images and texts. In this paper, we adapt the idea of distant supervision that extracts the knowledge about object relationships from an external corpus and imparts them to images to facilitate inferring visual object relationships, without introducing any extra pre-trained relationship detectors. Based on these learned informative relationships, we construct pseudo image-sentence pairs for captioning model training. Specifically, our method consists of three modules: (i) a relationship learning module that learns to infer relationships from images under the distant supervision; (ii) a relationship-to-sentence module that transforms the inferred relationships into sentences to generate pseudo image-sentence pairs; (iii) an image captioning module that is trained by using the generated image-sentence pairs. Promising results on three datasets show that our method outperforms the state-of-the-art methods of unsupervised image captioning.
UR - http://www.scopus.com/inward/record.url?scp=85189549955&partnerID=8YFLogxK
U2 - 10.1609/aaai.v38i5.28251
DO - 10.1609/aaai.v38i5.28251
M3 - Conference contribution
AN - SCOPUS:85189549955
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 4524
EP - 4532
BT - Technical Tracks 14
A2 - Wooldridge, Michael
A2 - Dy, Jennifer
A2 - Natarajan, Sriraam
PB - Association for the Advancement of Artificial Intelligence
T2 - 38th AAAI Conference on Artificial Intelligence, AAAI 2024
Y2 - 20 February 2024 through 27 February 2024
ER -