Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Ling Cheng, Wei Wei*, Xianling Mao, Yong Liu, Chunyan Miao

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

17 引用 (Scopus)
Plum Print visual indicator of research metrics
  • Citations
    • Citation Indexes: 16
  • Captures
    • Readers: 34
see details

摘要

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semantic-level attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.

源语言英语
文章编号9174742
页(从-至)154953-154965
页数13
期刊IEEE Access
8
DOI
出版状态已出版 - 2020

指纹

探究 'Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation' 的科研主题。它们共同构成独一无二的指纹。

引用此

Cheng, L., Wei, W., Mao, X., Liu, Y., & Miao, C. (2020). Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation. IEEE Access, 8, 154953-154965. 文章 9174742. https://doi.org/10.1109/ACCESS.2020.3018752