Joint commonsense and relation reasoning for image and video captioning

Jingyi Hou, Xinxiao Wu*, Xiaoxun Zhang, Yayun Qi, Yunde Jia, Jiebo Luo

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

51 Citations (Scopus)

Abstract

Exploiting relationships between objects for image and video captioning has received increasing attention. Most existing methods depend heavily on pre-trained detectors of objects and their relationships, and thus may not work well when facing detection challenges such as heavy occlusion, tiny-size objects, and long-tail classes. In this paper, we propose a joint commonsense and relation reasoning method that exploits prior knowledge for image and video captioning without relying on any detectors. The prior knowledge provides semantic correlations and constraints between objects, serving as guidance to build semantic graphs that summarize object relationships, some of which cannot be directly perceived from images or videos. Particularly, our method is implemented by an iterative learning algorithm that alternates between 1) commonsense reasoning for embedding visual regions into the semantic space to build a semantic graph and 2) relation reasoning for encoding semantic graphs to generate sentences. Experiments on several benchmark datasets validate the effectiveness of our prior knowledge-based approach.

Original languageEnglish
Title of host publicationAAAI 2020 - 34th AAAI Conference on Artificial Intelligence
PublisherAAAI press
Pages10973-10980
Number of pages8
ISBN (Electronic)9781577358350
Publication statusPublished - 2020
Event34th AAAI Conference on Artificial Intelligence, AAAI 2020 - New York, United States
Duration: 7 Feb 202012 Feb 2020

Publication series

NameAAAI 2020 - 34th AAAI Conference on Artificial Intelligence

Conference

Conference34th AAAI Conference on Artificial Intelligence, AAAI 2020
Country/TerritoryUnited States
CityNew York
Period7/02/2012/02/20

Fingerprint

Dive into the research topics of 'Joint commonsense and relation reasoning for image and video captioning'. Together they form a unique fingerprint.

Cite this