Hierarchical encoder-decoder for image captioning

  • Lizhi Pan
  • , Chengtian Song*
  • , Xiaozheng Gan
  • , Keyu Xu
  • , Mengqian Deng
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Image captioning aims to encode the visual information from images and integrate it into text decoding to generate an accurate natural language descriptions. Existing methods often neglect the hierarchical structure of visual information, fail to model global relationships between visual elements, and have not yet explored the synergistic effect of multi-level semantics, resulting in inaccurate captions. To address these issues, we propose a Hierarchical Encoder-decoder for Image Captioning (HierCap) to guide text generation with hierarchical visual information at three levels: global (encompassing positional relationships), regional (highlighting principal objects), and grid (capturing local details). Specifically, the hierarchical encoder employs three dedicated sub-encoders to build complementary visual representations at each level. For the decoder, a hierarchical fusion module with four variants is provided to explore the cross-modal synergistic fusion between hierarchical visual features and textual features. Extensive experiments demonstrate that HierCap achieves state-of-the-art performance on four datasets: COCO, NoCaps, Flickr8k, and Flickr30k. The results validate the effectiveness of hierarchical visual encoding and cross-modal hierarchical fusion in generating accurate and semantically rich descriptions. The source code is available at https://github.com/Panlizhi/HierCap.

Original languageEnglish
Article number131833
JournalNeurocomputing
Volume660
DOIs
Publication statusPublished - 7 Jan 2026

Keywords

  • Global features
  • Grid features
  • Hierarchical encoder-decoder
  • Hierarchical fusion
  • Image captioning
  • Region features

Fingerprint

Dive into the research topics of 'Hierarchical encoder-decoder for image captioning'. Together they form a unique fingerprint.

Cite this