TY - GEN
T1 - Joint commonsense and relation reasoning for image and video captioning
AU - Hou, Jingyi
AU - Wu, Xinxiao
AU - Zhang, Xiaoxun
AU - Qi, Yayun
AU - Jia, Yunde
AU - Luo, Jiebo
N1 - Publisher Copyright:
© AAAI 2020 - 34th AAAI Conference on Artificial Intelligence. All Rights Reserved.
PY - 2020
Y1 - 2020
N2 - Exploiting relationships between objects for image and video captioning has received increasing attention. Most existing methods depend heavily on pre-trained detectors of objects and their relationships, and thus may not work well when facing detection challenges such as heavy occlusion, tiny-size objects, and long-tail classes. In this paper, we propose a joint commonsense and relation reasoning method that exploits prior knowledge for image and video captioning without relying on any detectors. The prior knowledge provides semantic correlations and constraints between objects, serving as guidance to build semantic graphs that summarize object relationships, some of which cannot be directly perceived from images or videos. Particularly, our method is implemented by an iterative learning algorithm that alternates between 1) commonsense reasoning for embedding visual regions into the semantic space to build a semantic graph and 2) relation reasoning for encoding semantic graphs to generate sentences. Experiments on several benchmark datasets validate the effectiveness of our prior knowledge-based approach.
AB - Exploiting relationships between objects for image and video captioning has received increasing attention. Most existing methods depend heavily on pre-trained detectors of objects and their relationships, and thus may not work well when facing detection challenges such as heavy occlusion, tiny-size objects, and long-tail classes. In this paper, we propose a joint commonsense and relation reasoning method that exploits prior knowledge for image and video captioning without relying on any detectors. The prior knowledge provides semantic correlations and constraints between objects, serving as guidance to build semantic graphs that summarize object relationships, some of which cannot be directly perceived from images or videos. Particularly, our method is implemented by an iterative learning algorithm that alternates between 1) commonsense reasoning for embedding visual regions into the semantic space to build a semantic graph and 2) relation reasoning for encoding semantic graphs to generate sentences. Experiments on several benchmark datasets validate the effectiveness of our prior knowledge-based approach.
UR - http://www.scopus.com/inward/record.url?scp=85099000915&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85099000915
T3 - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
SP - 10973
EP - 10980
BT - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
PB - AAAI press
T2 - 34th AAAI Conference on Artificial Intelligence, AAAI 2020
Y2 - 7 February 2020 through 12 February 2020
ER -