Transformer with Prior Language Knowledge for Image Captioning

Daisong Yan; Wenxin Yu; Zhiqiang Zhang; Jun Gong

doi:10.1007/978-3-030-92270-2_4

Transformer with Prior Language Knowledge for Image Captioning

Daisong Yan, Wenxin Yu^*, Zhiqiang Zhang, Jun Gong

^*Corresponding author for this work

School of Foreign Languages

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image captioning, we present Prior Language Knowledge Transformer (PLKT)—a transformer-based model that can integrate learned a priori language knowledge for image captioning. In our proposal, when our model predicts the next word, it not only depends on the previously generated sequence but also relies on prior language knowledge. To obtain prior language knowledge, we embed a learnable memory vector inside the self-attention. Meanwhile, we use reinforcement learning to fine-tune the model in training. To prove the advancement and promising effectiveness of PLKT, we compare our approach with other recent image captioning methods in the experiments. Through objective results, our proposal increased the CIDEr score of the baseline by 0.6 points on the “Karpathy” test split when tested on COCO2014 dataset. In subjective results, our approach generated sentences is obviously better than baseline in grammar.

Original language	English
Title of host publication	Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings
Editors	Teddy Mantoro, Minho Lee, Media Anugerah Ayu, Kok Wai Wong, Achmad Nizar Hidayanto
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	40-51
Number of pages	12
ISBN (Print)	9783030922696
DOIs	https://doi.org/10.1007/978-3-030-92270-2_4
Publication status	Published - 2021
Event	28th International Conference on Neural Information Processing, ICONIP 2021 - Virtual, Online Duration: 8 Dec 2021 → 12 Dec 2021

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	13109 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	28th International Conference on Neural Information Processing, ICONIP 2021
City	Virtual, Online
Period	8/12/21 → 12/12/21

Keywords

Image caption
Priori language knowledge
Self-attention
Transformer

Access to Document

10.1007/978-3-030-92270-2_4

Cite this

Yan, D., Yu, W., Zhang, Z., & Gong, J. (2021). Transformer with Prior Language Knowledge for Image Captioning. In T. Mantoro, M. Lee, M. A. Ayu, K. W. Wong, & A. N. Hidayanto (Eds.), Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings (pp. 40-51). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13109 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-92270-2_4

Yan, Daisong ; Yu, Wenxin ; Zhang, Zhiqiang et al. / Transformer with Prior Language Knowledge for Image Captioning. Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings. editor / Teddy Mantoro ; Minho Lee ; Media Anugerah Ayu ; Kok Wai Wong ; Achmad Nizar Hidayanto. Springer Science and Business Media Deutschland GmbH, 2021. pp. 40-51 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{fb3c2fe1ff424cb797c9a695cab6b975,

title = "Transformer with Prior Language Knowledge for Image Captioning",

abstract = "The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image captioning, we present Prior Language Knowledge Transformer (PLKT)—a transformer-based model that can integrate learned a priori language knowledge for image captioning. In our proposal, when our model predicts the next word, it not only depends on the previously generated sequence but also relies on prior language knowledge. To obtain prior language knowledge, we embed a learnable memory vector inside the self-attention. Meanwhile, we use reinforcement learning to fine-tune the model in training. To prove the advancement and promising effectiveness of PLKT, we compare our approach with other recent image captioning methods in the experiments. Through objective results, our proposal increased the CIDEr score of the baseline by 0.6 points on the “Karpathy” test split when tested on COCO2014 dataset. In subjective results, our approach generated sentences is obviously better than baseline in grammar.",

keywords = "Image caption, Priori language knowledge, Self-attention, Transformer",

author = "Daisong Yan and Wenxin Yu and Zhiqiang Zhang and Jun Gong",

note = "Publisher Copyright: {\textcopyright} 2021, Springer Nature Switzerland AG.; 28th International Conference on Neural Information Processing, ICONIP 2021 ; Conference date: 08-12-2021 Through 12-12-2021",

year = "2021",

doi = "10.1007/978-3-030-92270-2_4",

language = "English",

isbn = "9783030922696",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "40--51",

editor = "Teddy Mantoro and Minho Lee and Ayu, {Media Anugerah} and Wong, {Kok Wai} and Hidayanto, {Achmad Nizar}",

booktitle = "Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings",

address = "Germany",

}

Yan, D, Yu, W, Zhang, Z & Gong, J 2021, Transformer with Prior Language Knowledge for Image Captioning. in T Mantoro, M Lee, MA Ayu, KW Wong & AN Hidayanto (eds), Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13109 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 40-51, 28th International Conference on Neural Information Processing, ICONIP 2021, Virtual, Online, 8/12/21. https://doi.org/10.1007/978-3-030-92270-2_4

Transformer with Prior Language Knowledge for Image Captioning. / Yan, Daisong; Yu, Wenxin; Zhang, Zhiqiang et al.
Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings. ed. / Teddy Mantoro; Minho Lee; Media Anugerah Ayu; Kok Wai Wong; Achmad Nizar Hidayanto. Springer Science and Business Media Deutschland GmbH, 2021. p. 40-51 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13109 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Transformer with Prior Language Knowledge for Image Captioning

AU - Yan, Daisong

AU - Yu, Wenxin

AU - Zhang, Zhiqiang

AU - Gong, Jun

PY - 2021

Y1 - 2021

N2 - The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image captioning, we present Prior Language Knowledge Transformer (PLKT)—a transformer-based model that can integrate learned a priori language knowledge for image captioning. In our proposal, when our model predicts the next word, it not only depends on the previously generated sequence but also relies on prior language knowledge. To obtain prior language knowledge, we embed a learnable memory vector inside the self-attention. Meanwhile, we use reinforcement learning to fine-tune the model in training. To prove the advancement and promising effectiveness of PLKT, we compare our approach with other recent image captioning methods in the experiments. Through objective results, our proposal increased the CIDEr score of the baseline by 0.6 points on the “Karpathy” test split when tested on COCO2014 dataset. In subjective results, our approach generated sentences is obviously better than baseline in grammar.

AB - The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image captioning, we present Prior Language Knowledge Transformer (PLKT)—a transformer-based model that can integrate learned a priori language knowledge for image captioning. In our proposal, when our model predicts the next word, it not only depends on the previously generated sequence but also relies on prior language knowledge. To obtain prior language knowledge, we embed a learnable memory vector inside the self-attention. Meanwhile, we use reinforcement learning to fine-tune the model in training. To prove the advancement and promising effectiveness of PLKT, we compare our approach with other recent image captioning methods in the experiments. Through objective results, our proposal increased the CIDEr score of the baseline by 0.6 points on the “Karpathy” test split when tested on COCO2014 dataset. In subjective results, our approach generated sentences is obviously better than baseline in grammar.

KW - Image caption

KW - Priori language knowledge

KW - Self-attention

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85121931472&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-92270-2_4

DO - 10.1007/978-3-030-92270-2_4

M3 - Conference contribution

AN - SCOPUS:85121931472

SN - 9783030922696

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 40

EP - 51

BT - Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings

A2 - Mantoro, Teddy

A2 - Lee, Minho

A2 - Ayu, Media Anugerah

A2 - Wong, Kok Wai

A2 - Hidayanto, Achmad Nizar

PB - Springer Science and Business Media Deutschland GmbH

T2 - 28th International Conference on Neural Information Processing, ICONIP 2021

Y2 - 8 December 2021 through 12 December 2021

ER -

Yan D, Yu W, Zhang Z, Gong J. Transformer with Prior Language Knowledge for Image Captioning. In Mantoro T, Lee M, Ayu MA, Wong KW, Hidayanto AN, editors, Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings. Springer Science and Business Media Deutschland GmbH. 2021. p. 40-51. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-92270-2_4