TY - JOUR
T1 - Stack-VS
T2 - Stacked Visual-Semantic Attention for Image Caption Generation
AU - Cheng, Ling
AU - Wei, Wei
AU - Mao, Xianling
AU - Liu, Yong
AU - Miao, Chunyan
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2020
Y1 - 2020
N2 - Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semantic-level attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.
AB - Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semantic-level attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.
KW - Attention based mechanism
KW - image captioning
KW - multi modal
KW - recurrent neural network
UR - http://www.scopus.com/inward/record.url?scp=85090945306&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2020.3018752
DO - 10.1109/ACCESS.2020.3018752
M3 - Article
AN - SCOPUS:85090945306
SN - 2169-3536
VL - 8
SP - 154953
EP - 154965
JO - IEEE Access
JF - IEEE Access
M1 - 9174742
ER -