Look and Review, Then Tell: Generate More Coherent Paragraphs from Images by Fusing Visual and Textual Information

Zhen Yang, Hongxia Zhao, Ping Jian*

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Image paragraph captioning aims to describe given images by generating natural paragraphs. Unfortunately, the paragraphs generated by existing methods typically suffer from poor coherence since the visual information is inevitably lost after the pooling operation, which maps numerous visual features to only one global vector. On the other hand, the pooled vectors make it harder for the language models to interact with details in images, leading to generic or even wrong descriptions of visual details. In this paper, we propose a simple yet effective module called Visual Information Enhancement Module (VIEM) to prevent the visual information loss in visual features pooling. Meanwhile, to model the inter-sentence dependency, a fusion gate mechanism, which makes the most of the nonpooled features by fusing visual vectors with textual information, is introduced into the language model to furthermore improve the paragraph coherence. In experiments, the visual information loss is quantitatively measured through a mutual information based method. Surprisingly, the results indicates that such loss in VIEM is only approximately 50% of that in pooling, effectively demonstrating the efficacy of VIEM. Moreover, extensive experiments on Stanford image-paragraph dataset show that the proposed method achieves promising performance compared with existing methods. We will release our code at https://github.com/Young-Zhen/paraCap.

源语言英语
主期刊名2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
出版商Institute of Electrical and Electronics Engineers Inc.
ISBN(电子版)9798350359312
DOI
出版状态已出版 - 2024
活动2024 International Joint Conference on Neural Networks, IJCNN 2024 - Yokohama, 日本
期限: 30 6月 20245 7月 2024

出版系列

姓名Proceedings of the International Joint Conference on Neural Networks

会议

会议2024 International Joint Conference on Neural Networks, IJCNN 2024
国家/地区日本
Yokohama
时期30/06/245/07/24

指纹

探究 'Look and Review, Then Tell: Generate More Coherent Paragraphs from Images by Fusing Visual and Textual Information' 的科研主题。它们共同构成独一无二的指纹。

引用此