TY - GEN
T1 - Relational Attention with Textual Enhanced Transformer for Image Captioning
AU - Song, Lifei
AU - Shi, Yiwen
AU - Xiao, Xinyu
AU - Zhang, Chunxia
AU - Xiang, Shiming
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Image captioning has attracted extensive research interests in recent years, which aims to generate a natural language description of an image. However, many approaches focus only on individual target object information without exploring the relationship between objects and the surrounding. It will greatly affect the performance of captioning models. In order to solve the above issue, we propose a relation model to incorporate relational information between objects from different levels into the captioning model, including low-level box proposals and high-level region features. Moreover, Transformer-based architectures have shown great success in image captioning, where image regions are encoded and then attended into attention vectors to guide the caption generation. However, the attention vectors only contain image-level information without considering the textual information, which fails to expand the capability of captioning in both visual and textual domains. In this paper, we introduce a Textual Enhanced Transformer (TET) to enable addition of textual information into Transformer. There are two modules in TET: text-guided Transformer and self-attention Transformer. The two modules perform semantic and visual attention to guide the decoder to generate high-quality captions. We extensively evaluate model on MS COCO dataset and it achieves 128.7 CIDEr-D score on Karpathy split and 126.3 CIDEr-D (c40) score on official online evaluation server.
AB - Image captioning has attracted extensive research interests in recent years, which aims to generate a natural language description of an image. However, many approaches focus only on individual target object information without exploring the relationship between objects and the surrounding. It will greatly affect the performance of captioning models. In order to solve the above issue, we propose a relation model to incorporate relational information between objects from different levels into the captioning model, including low-level box proposals and high-level region features. Moreover, Transformer-based architectures have shown great success in image captioning, where image regions are encoded and then attended into attention vectors to guide the caption generation. However, the attention vectors only contain image-level information without considering the textual information, which fails to expand the capability of captioning in both visual and textual domains. In this paper, we introduce a Textual Enhanced Transformer (TET) to enable addition of textual information into Transformer. There are two modules in TET: text-guided Transformer and self-attention Transformer. The two modules perform semantic and visual attention to guide the decoder to generate high-quality captions. We extensively evaluate model on MS COCO dataset and it achieves 128.7 CIDEr-D score on Karpathy split and 126.3 CIDEr-D (c40) score on official online evaluation server.
KW - Attention
KW - Relational information
KW - Textual enhanced transformer
UR - http://www.scopus.com/inward/record.url?scp=85118197108&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-88010-1_13
DO - 10.1007/978-3-030-88010-1_13
M3 - Conference contribution
AN - SCOPUS:85118197108
SN - 9783030880095
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 151
EP - 163
BT - Pattern Recognition and Computer Vision - 4th Chinese Conference, PRCV 2021, Proceedings
A2 - Ma, Huimin
A2 - Wang, Liang
A2 - Zhang, Changshui
A2 - Wu, Fei
A2 - Tan, Tieniu
A2 - Wang, Yaonan
A2 - Lai, Jianhuang
A2 - Zhao, Yao
PB - Springer Science and Business Media Deutschland GmbH
T2 - 4th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2021
Y2 - 29 October 2021 through 1 November 2021
ER -