Relational Attention with Textual Enhanced Transformer for Image Captioning

Lifei Song*, Yiwen Shi, Xinyu Xiao, Chunxia Zhang, Shiming Xiang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Image captioning has attracted extensive research interests in recent years, which aims to generate a natural language description of an image. However, many approaches focus only on individual target object information without exploring the relationship between objects and the surrounding. It will greatly affect the performance of captioning models. In order to solve the above issue, we propose a relation model to incorporate relational information between objects from different levels into the captioning model, including low-level box proposals and high-level region features. Moreover, Transformer-based architectures have shown great success in image captioning, where image regions are encoded and then attended into attention vectors to guide the caption generation. However, the attention vectors only contain image-level information without considering the textual information, which fails to expand the capability of captioning in both visual and textual domains. In this paper, we introduce a Textual Enhanced Transformer (TET) to enable addition of textual information into Transformer. There are two modules in TET: text-guided Transformer and self-attention Transformer. The two modules perform semantic and visual attention to guide the decoder to generate high-quality captions. We extensively evaluate model on MS COCO dataset and it achieves 128.7 CIDEr-D score on Karpathy split and 126.3 CIDEr-D (c40) score on official online evaluation server.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision - 4th Chinese Conference, PRCV 2021, Proceedings
EditorsHuimin Ma, Liang Wang, Changshui Zhang, Fei Wu, Tieniu Tan, Yaonan Wang, Jianhuang Lai, Yao Zhao
PublisherSpringer Science and Business Media Deutschland GmbH
Pages151-163
Number of pages13
ISBN (Print)9783030880095
DOIs
Publication statusPublished - 2021
Event4th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2021 - Beijing, China
Duration: 29 Oct 20211 Nov 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13021 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference4th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2021
Country/TerritoryChina
CityBeijing
Period29/10/211/11/21

Keywords

  • Attention
  • Relational information
  • Textual enhanced transformer

Fingerprint

Dive into the research topics of 'Relational Attention with Textual Enhanced Transformer for Image Captioning'. Together they form a unique fingerprint.

Cite this