TY - GEN
T1 - Look and Review, Then Tell
T2 - 2024 International Joint Conference on Neural Networks, IJCNN 2024
AU - Yang, Zhen
AU - Zhao, Hongxia
AU - Jian, Ping
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Image paragraph captioning aims to describe given images by generating natural paragraphs. Unfortunately, the paragraphs generated by existing methods typically suffer from poor coherence since the visual information is inevitably lost after the pooling operation, which maps numerous visual features to only one global vector. On the other hand, the pooled vectors make it harder for the language models to interact with details in images, leading to generic or even wrong descriptions of visual details. In this paper, we propose a simple yet effective module called Visual Information Enhancement Module (VIEM) to prevent the visual information loss in visual features pooling. Meanwhile, to model the inter-sentence dependency, a fusion gate mechanism, which makes the most of the nonpooled features by fusing visual vectors with textual information, is introduced into the language model to furthermore improve the paragraph coherence. In experiments, the visual information loss is quantitatively measured through a mutual information based method. Surprisingly, the results indicates that such loss in VIEM is only approximately 50% of that in pooling, effectively demonstrating the efficacy of VIEM. Moreover, extensive experiments on Stanford image-paragraph dataset show that the proposed method achieves promising performance compared with existing methods. We will release our code at https://github.com/Young-Zhen/paraCap.
AB - Image paragraph captioning aims to describe given images by generating natural paragraphs. Unfortunately, the paragraphs generated by existing methods typically suffer from poor coherence since the visual information is inevitably lost after the pooling operation, which maps numerous visual features to only one global vector. On the other hand, the pooled vectors make it harder for the language models to interact with details in images, leading to generic or even wrong descriptions of visual details. In this paper, we propose a simple yet effective module called Visual Information Enhancement Module (VIEM) to prevent the visual information loss in visual features pooling. Meanwhile, to model the inter-sentence dependency, a fusion gate mechanism, which makes the most of the nonpooled features by fusing visual vectors with textual information, is introduced into the language model to furthermore improve the paragraph coherence. In experiments, the visual information loss is quantitatively measured through a mutual information based method. Surprisingly, the results indicates that such loss in VIEM is only approximately 50% of that in pooling, effectively demonstrating the efficacy of VIEM. Moreover, extensive experiments on Stanford image-paragraph dataset show that the proposed method achieves promising performance compared with existing methods. We will release our code at https://github.com/Young-Zhen/paraCap.
KW - Coherence Modeling
KW - Fusion Gate Mechanism
KW - Image Paragraph Captioning
KW - Visual Information Loss
UR - http://www.scopus.com/inward/record.url?scp=85204977133&partnerID=8YFLogxK
U2 - 10.1109/IJCNN60899.2024.10650474
DO - 10.1109/IJCNN60899.2024.10650474
M3 - Conference contribution
AN - SCOPUS:85204977133
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 30 June 2024 through 5 July 2024
ER -