Transformer with Prior Language Knowledge for Image Captioning

Daisong Yan, Wenxin Yu*, Zhiqiang Zhang, Jun Gong

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image captioning, we present Prior Language Knowledge Transformer (PLKT)—a transformer-based model that can integrate learned a priori language knowledge for image captioning. In our proposal, when our model predicts the next word, it not only depends on the previously generated sequence but also relies on prior language knowledge. To obtain prior language knowledge, we embed a learnable memory vector inside the self-attention. Meanwhile, we use reinforcement learning to fine-tune the model in training. To prove the advancement and promising effectiveness of PLKT, we compare our approach with other recent image captioning methods in the experiments. Through objective results, our proposal increased the CIDEr score of the baseline by 0.6 points on the “Karpathy” test split when tested on COCO2014 dataset. In subjective results, our approach generated sentences is obviously better than baseline in grammar.

Original languageEnglish
Title of host publicationNeural Information Processing - 28th International Conference, ICONIP 2021, Proceedings
EditorsTeddy Mantoro, Minho Lee, Media Anugerah Ayu, Kok Wai Wong, Achmad Nizar Hidayanto
PublisherSpringer Science and Business Media Deutschland GmbH
Pages40-51
Number of pages12
ISBN (Print)9783030922696
DOIs
Publication statusPublished - 2021
Event28th International Conference on Neural Information Processing, ICONIP 2021 - Virtual, Online
Duration: 8 Dec 202112 Dec 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13109 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference28th International Conference on Neural Information Processing, ICONIP 2021
CityVirtual, Online
Period8/12/2112/12/21

Keywords

  • Image caption
  • Priori language knowledge
  • Self-attention
  • Transformer

Fingerprint

Dive into the research topics of 'Transformer with Prior Language Knowledge for Image Captioning'. Together they form a unique fingerprint.

Cite this