TY - GEN
T1 - Transformer with Prior Language Knowledge for Image Captioning
AU - Yan, Daisong
AU - Yu, Wenxin
AU - Zhang, Zhiqiang
AU - Gong, Jun
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image captioning, we present Prior Language Knowledge Transformer (PLKT)—a transformer-based model that can integrate learned a priori language knowledge for image captioning. In our proposal, when our model predicts the next word, it not only depends on the previously generated sequence but also relies on prior language knowledge. To obtain prior language knowledge, we embed a learnable memory vector inside the self-attention. Meanwhile, we use reinforcement learning to fine-tune the model in training. To prove the advancement and promising effectiveness of PLKT, we compare our approach with other recent image captioning methods in the experiments. Through objective results, our proposal increased the CIDEr score of the baseline by 0.6 points on the “Karpathy” test split when tested on COCO2014 dataset. In subjective results, our approach generated sentences is obviously better than baseline in grammar.
AB - The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image captioning, we present Prior Language Knowledge Transformer (PLKT)—a transformer-based model that can integrate learned a priori language knowledge for image captioning. In our proposal, when our model predicts the next word, it not only depends on the previously generated sequence but also relies on prior language knowledge. To obtain prior language knowledge, we embed a learnable memory vector inside the self-attention. Meanwhile, we use reinforcement learning to fine-tune the model in training. To prove the advancement and promising effectiveness of PLKT, we compare our approach with other recent image captioning methods in the experiments. Through objective results, our proposal increased the CIDEr score of the baseline by 0.6 points on the “Karpathy” test split when tested on COCO2014 dataset. In subjective results, our approach generated sentences is obviously better than baseline in grammar.
KW - Image caption
KW - Priori language knowledge
KW - Self-attention
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85121931472&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-92270-2_4
DO - 10.1007/978-3-030-92270-2_4
M3 - Conference contribution
AN - SCOPUS:85121931472
SN - 9783030922696
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 40
EP - 51
BT - Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings
A2 - Mantoro, Teddy
A2 - Lee, Minho
A2 - Ayu, Media Anugerah
A2 - Wong, Kok Wai
A2 - Hidayanto, Achmad Nizar
PB - Springer Science and Business Media Deutschland GmbH
T2 - 28th International Conference on Neural Information Processing, ICONIP 2021
Y2 - 8 December 2021 through 12 December 2021
ER -