Relational Attention with Textual Enhanced Transformer for Image Captioning

Lifei Song*, Yiwen Shi, Xinyu Xiao, Chunxia Zhang, Shiming Xiang

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

1 引用 (Scopus)

摘要

Image captioning has attracted extensive research interests in recent years, which aims to generate a natural language description of an image. However, many approaches focus only on individual target object information without exploring the relationship between objects and the surrounding. It will greatly affect the performance of captioning models. In order to solve the above issue, we propose a relation model to incorporate relational information between objects from different levels into the captioning model, including low-level box proposals and high-level region features. Moreover, Transformer-based architectures have shown great success in image captioning, where image regions are encoded and then attended into attention vectors to guide the caption generation. However, the attention vectors only contain image-level information without considering the textual information, which fails to expand the capability of captioning in both visual and textual domains. In this paper, we introduce a Textual Enhanced Transformer (TET) to enable addition of textual information into Transformer. There are two modules in TET: text-guided Transformer and self-attention Transformer. The two modules perform semantic and visual attention to guide the decoder to generate high-quality captions. We extensively evaluate model on MS COCO dataset and it achieves 128.7 CIDEr-D score on Karpathy split and 126.3 CIDEr-D (c40) score on official online evaluation server.

源语言英语
主期刊名Pattern Recognition and Computer Vision - 4th Chinese Conference, PRCV 2021, Proceedings
编辑Huimin Ma, Liang Wang, Changshui Zhang, Fei Wu, Tieniu Tan, Yaonan Wang, Jianhuang Lai, Yao Zhao
出版商Springer Science and Business Media Deutschland GmbH
151-163
页数13
ISBN(印刷版)9783030880095
DOI
出版状态已出版 - 2021
活动4th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2021 - Beijing, 中国
期限: 29 10月 20211 11月 2021

出版系列

姓名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
13021 LNCS
ISSN(印刷版)0302-9743
ISSN(电子版)1611-3349

会议

会议4th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2021
国家/地区中国
Beijing
时期29/10/211/11/21

指纹

探究 'Relational Attention with Textual Enhanced Transformer for Image Captioning' 的科研主题。它们共同构成独一无二的指纹。

引用此