Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Ling Cheng; Wei Wei; Xianling Mao; Yong Liu; Chunyan Miao

doi:10.1109/ACCESS.2020.3018752

Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Ling Cheng, Wei Wei^*, Xianling Mao, Yong Liu, Chunyan Miao

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

17 引用（Scopus）

摘要

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semantic-level attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.

源语言	英语
文章编号	9174742
页（从-至）	154953-154965
页数	13
期刊	IEEE Access
卷	8
DOI	https://doi.org/10.1109/ACCESS.2020.3018752
出版状态	已出版 - 2020

访问文件

10.1109/ACCESS.2020.3018752

其它文件与链接

链接到 Scopus 的出版物

引用此

Cheng, L., Wei, W., Mao, X., Liu, Y., & Miao, C. (2020). Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation. IEEE Access, 8, 154953-154965. 文章 9174742. https://doi.org/10.1109/ACCESS.2020.3018752

@article{aa7fe0503ffc4a4ca6ee4556b1bf78a4,

title = "Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation",

abstract = "Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semantic-level attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.",

keywords = "Attention based mechanism, image captioning, multi modal, recurrent neural network",

author = "Ling Cheng and Wei Wei and Xianling Mao and Yong Liu and Chunyan Miao",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2020",

doi = "10.1109/ACCESS.2020.3018752",

language = "English",

volume = "8",

pages = "154953--154965",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Stack-VS

T2 - Stacked Visual-Semantic Attention for Image Caption Generation

AU - Cheng, Ling

AU - Wei, Wei

AU - Mao, Xianling

AU - Liu, Yong

AU - Miao, Chunyan

PY - 2020

Y1 - 2020

N2 - Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semantic-level attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.

AB - Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semantic-level attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.

KW - Attention based mechanism

KW - image captioning

KW - multi modal

KW - recurrent neural network

UR - http://www.scopus.com/inward/record.url?scp=85090945306&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2020.3018752

DO - 10.1109/ACCESS.2020.3018752

M3 - Article

AN - SCOPUS:85090945306

SN - 2169-3536

VL - 8

SP - 154953

EP - 154965

JO - IEEE Access

JF - IEEE Access

M1 - 9174742

ER -

Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

摘要

访问文件

其它文件与链接

指纹

引用此