Look and Review, Then Tell: Generate More Coherent Paragraphs from Images by Fusing Visual and Textual Information

Zhen Yang, Hongxia Zhao, Ping Jian*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Image paragraph captioning aims to describe given images by generating natural paragraphs. Unfortunately, the paragraphs generated by existing methods typically suffer from poor coherence since the visual information is inevitably lost after the pooling operation, which maps numerous visual features to only one global vector. On the other hand, the pooled vectors make it harder for the language models to interact with details in images, leading to generic or even wrong descriptions of visual details. In this paper, we propose a simple yet effective module called Visual Information Enhancement Module (VIEM) to prevent the visual information loss in visual features pooling. Meanwhile, to model the inter-sentence dependency, a fusion gate mechanism, which makes the most of the nonpooled features by fusing visual vectors with textual information, is introduced into the language model to furthermore improve the paragraph coherence. In experiments, the visual information loss is quantitatively measured through a mutual information based method. Surprisingly, the results indicates that such loss in VIEM is only approximately 50% of that in pooling, effectively demonstrating the efficacy of VIEM. Moreover, extensive experiments on Stanford image-paragraph dataset show that the proposed method achieves promising performance compared with existing methods. We will release our code at https://github.com/Young-Zhen/paraCap.

Original languageEnglish
Title of host publication2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350359312
DOIs
Publication statusPublished - 2024
Event2024 International Joint Conference on Neural Networks, IJCNN 2024 - Yokohama, Japan
Duration: 30 Jun 20245 Jul 2024

Publication series

NameProceedings of the International Joint Conference on Neural Networks

Conference

Conference2024 International Joint Conference on Neural Networks, IJCNN 2024
Country/TerritoryJapan
CityYokohama
Period30/06/245/07/24

Keywords

  • Coherence Modeling
  • Fusion Gate Mechanism
  • Image Paragraph Captioning
  • Visual Information Loss

Fingerprint

Dive into the research topics of 'Look and Review, Then Tell: Generate More Coherent Paragraphs from Images by Fusing Visual and Textual Information'. Together they form a unique fingerprint.

Cite this