Relational Distant Supervision for Image Captioning without Image-Text Pairs

Yayun Qi; Wentian Zhao; Xinxiao Wu

doi:10.1609/aaai.v38i5.28251

Relational Distant Supervision for Image Captioning without Image-Text Pairs

Yayun Qi, Wentian Zhao, Xinxiao Wu^*

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Citation (Scopus)

Abstract

Unsupervised image captioning aims to generate descriptions of images without relying on any image-sentence pairs for training. Most existing works use detected visual objects or concepts as bridge to connect images and texts. Considering that the relationship between objects carries more information, we use the object relationship as a more accurate connection between images and texts. In this paper, we adapt the idea of distant supervision that extracts the knowledge about object relationships from an external corpus and imparts them to images to facilitate inferring visual object relationships, without introducing any extra pre-trained relationship detectors. Based on these learned informative relationships, we construct pseudo image-sentence pairs for captioning model training. Specifically, our method consists of three modules: (i) a relationship learning module that learns to infer relationships from images under the distant supervision; (ii) a relationship-to-sentence module that transforms the inferred relationships into sentences to generate pseudo image-sentence pairs; (iii) an image captioning module that is trained by using the generated image-sentence pairs. Promising results on three datasets show that our method outperforms the state-of-the-art methods of unsupervised image captioning.

Original language	English
Title of host publication	Technical Tracks 14
Editors	Michael Wooldridge, Jennifer Dy, Sriraam Natarajan
Publisher	Association for the Advancement of Artificial Intelligence
Pages	4524-4532
Number of pages	9
Edition	5
ISBN (Electronic)	1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879
DOIs	https://doi.org/10.1609/aaai.v38i5.28251
Publication status	Published - 25 Mar 2024
Event	38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, Canada Duration: 20 Feb 2024 → 27 Feb 2024

Publication series

Name	Proceedings of the AAAI Conference on Artificial Intelligence
Number	5
Volume	38
ISSN (Print)	2159-5399
ISSN (Electronic)	2374-3468

Conference

Conference	38th AAAI Conference on Artificial Intelligence, AAAI 2024
Country/Territory	Canada
City	Vancouver
Period	20/02/24 → 27/02/24

Access to Document

10.1609/aaai.v38i5.28251

Cite this

Qi, Y., Zhao, W., & Wu, X. (2024). Relational Distant Supervision for Image Captioning without Image-Text Pairs. In M. Wooldridge, J. Dy, & S. Natarajan (Eds.), Technical Tracks 14 (5 ed., pp. 4524-4532). (Proceedings of the AAAI Conference on Artificial Intelligence; Vol. 38, No. 5). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i5.28251

@inproceedings{ac095bec6a5844a8b85daa0443c77918,

title = "Relational Distant Supervision for Image Captioning without Image-Text Pairs",

abstract = "Unsupervised image captioning aims to generate descriptions of images without relying on any image-sentence pairs for training. Most existing works use detected visual objects or concepts as bridge to connect images and texts. Considering that the relationship between objects carries more information, we use the object relationship as a more accurate connection between images and texts. In this paper, we adapt the idea of distant supervision that extracts the knowledge about object relationships from an external corpus and imparts them to images to facilitate inferring visual object relationships, without introducing any extra pre-trained relationship detectors. Based on these learned informative relationships, we construct pseudo image-sentence pairs for captioning model training. Specifically, our method consists of three modules: (i) a relationship learning module that learns to infer relationships from images under the distant supervision; (ii) a relationship-to-sentence module that transforms the inferred relationships into sentences to generate pseudo image-sentence pairs; (iii) an image captioning module that is trained by using the generated image-sentence pairs. Promising results on three datasets show that our method outperforms the state-of-the-art methods of unsupervised image captioning.",

author = "Yayun Qi and Wentian Zhao and Xinxiao Wu",

note = "Publisher Copyright: Copyright {\textcopyright} 2024, Association for the Advancement of Artificial Intelligence.; 38th AAAI Conference on Artificial Intelligence, AAAI 2024 ; Conference date: 20-02-2024 Through 27-02-2024",

year = "2024",

month = mar,

day = "25",

doi = "10.1609/aaai.v38i5.28251",

language = "English",

series = "Proceedings of the AAAI Conference on Artificial Intelligence",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "5",

pages = "4524--4532",

editor = "Michael Wooldridge and Jennifer Dy and Sriraam Natarajan",

booktitle = "Technical Tracks 14",

edition = "5",

}

Qi, Y, Zhao, W & Wu, X 2024, Relational Distant Supervision for Image Captioning without Image-Text Pairs. in M Wooldridge, J Dy & S Natarajan (eds), Technical Tracks 14. 5 edn, Proceedings of the AAAI Conference on Artificial Intelligence, no. 5, vol. 38, Association for the Advancement of Artificial Intelligence, pp. 4524-4532, 38th AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, Canada, 20/02/24. https://doi.org/10.1609/aaai.v38i5.28251

Relational Distant Supervision for Image Captioning without Image-Text Pairs. / Qi, Yayun; Zhao, Wentian; Wu, Xinxiao.
Technical Tracks 14. ed. / Michael Wooldridge; Jennifer Dy; Sriraam Natarajan. 5. ed. Association for the Advancement of Artificial Intelligence, 2024. p. 4524-4532 (Proceedings of the AAAI Conference on Artificial Intelligence; Vol. 38, No. 5).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Relational Distant Supervision for Image Captioning without Image-Text Pairs

AU - Qi, Yayun

AU - Zhao, Wentian

AU - Wu, Xinxiao

PY - 2024/3/25

Y1 - 2024/3/25

N2 - Unsupervised image captioning aims to generate descriptions of images without relying on any image-sentence pairs for training. Most existing works use detected visual objects or concepts as bridge to connect images and texts. Considering that the relationship between objects carries more information, we use the object relationship as a more accurate connection between images and texts. In this paper, we adapt the idea of distant supervision that extracts the knowledge about object relationships from an external corpus and imparts them to images to facilitate inferring visual object relationships, without introducing any extra pre-trained relationship detectors. Based on these learned informative relationships, we construct pseudo image-sentence pairs for captioning model training. Specifically, our method consists of three modules: (i) a relationship learning module that learns to infer relationships from images under the distant supervision; (ii) a relationship-to-sentence module that transforms the inferred relationships into sentences to generate pseudo image-sentence pairs; (iii) an image captioning module that is trained by using the generated image-sentence pairs. Promising results on three datasets show that our method outperforms the state-of-the-art methods of unsupervised image captioning.

AB - Unsupervised image captioning aims to generate descriptions of images without relying on any image-sentence pairs for training. Most existing works use detected visual objects or concepts as bridge to connect images and texts. Considering that the relationship between objects carries more information, we use the object relationship as a more accurate connection between images and texts. In this paper, we adapt the idea of distant supervision that extracts the knowledge about object relationships from an external corpus and imparts them to images to facilitate inferring visual object relationships, without introducing any extra pre-trained relationship detectors. Based on these learned informative relationships, we construct pseudo image-sentence pairs for captioning model training. Specifically, our method consists of three modules: (i) a relationship learning module that learns to infer relationships from images under the distant supervision; (ii) a relationship-to-sentence module that transforms the inferred relationships into sentences to generate pseudo image-sentence pairs; (iii) an image captioning module that is trained by using the generated image-sentence pairs. Promising results on three datasets show that our method outperforms the state-of-the-art methods of unsupervised image captioning.

UR - http://www.scopus.com/inward/record.url?scp=85189549955&partnerID=8YFLogxK

U2 - 10.1609/aaai.v38i5.28251

DO - 10.1609/aaai.v38i5.28251

M3 - Conference contribution

AN - SCOPUS:85189549955

T3 - Proceedings of the AAAI Conference on Artificial Intelligence

SP - 4524

EP - 4532

BT - Technical Tracks 14

A2 - Wooldridge, Michael

A2 - Dy, Jennifer

A2 - Natarajan, Sriraam

PB - Association for the Advancement of Artificial Intelligence

T2 - 38th AAAI Conference on Artificial Intelligence, AAAI 2024

Y2 - 20 February 2024 through 27 February 2024

ER -

Relational Distant Supervision for Image Captioning without Image-Text Pairs

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this