Abstract
Image captioning aims to encode the visual information from images and integrate it into text decoding to generate an accurate natural language descriptions. Existing methods often neglect the hierarchical structure of visual information, fail to model global relationships between visual elements, and have not yet explored the synergistic effect of multi-level semantics, resulting in inaccurate captions. To address these issues, we propose a Hierarchical Encoder-decoder for Image Captioning (HierCap) to guide text generation with hierarchical visual information at three levels: global (encompassing positional relationships), regional (highlighting principal objects), and grid (capturing local details). Specifically, the hierarchical encoder employs three dedicated sub-encoders to build complementary visual representations at each level. For the decoder, a hierarchical fusion module with four variants is provided to explore the cross-modal synergistic fusion between hierarchical visual features and textual features. Extensive experiments demonstrate that HierCap achieves state-of-the-art performance on four datasets: COCO, NoCaps, Flickr8k, and Flickr30k. The results validate the effectiveness of hierarchical visual encoding and cross-modal hierarchical fusion in generating accurate and semantically rich descriptions. The source code is available at https://github.com/Panlizhi/HierCap.
| Original language | English |
|---|---|
| Article number | 131833 |
| Journal | Neurocomputing |
| Volume | 660 |
| DOIs | |
| Publication status | Published - 7 Jan 2026 |
Keywords
- Global features
- Grid features
- Hierarchical encoder-decoder
- Hierarchical fusion
- Image captioning
- Region features