摘要
Images and text serve as fundamental carriers for conveying emotions in daily human communication. Sentimental image captioning requires models to not only accurately describe the visual content but also appropriately express underlying visual sentiments. Compared with the conventional image captioning task that focuses purely on factual semantics, sentimental image captioning emphasizes the affective alignment between visual elements and linguistic expressions, making it particularly valuable for applications such as social media recommendation and human-computer interaction. Existing sentimental image captioning methods typically rely on large-scale pairs of images and sentimental captions. However, their annotation process is expensive, labor-intensive, and error-prone. Moreover, existing sentimental-related datasets mainly focus on single modality data with sentiment class labels or paired with texts crawled from social media posts that have large discrepancies with image descriptions, which cannot be used as the training data for sentimental image captioning. To address this limitation, we propose a novel task called unsupervised sentimental image captioning, which aims to generate image descriptions using inherent sentiments without requiring any paired image-sentence data for training. The main challenge lies in how to enable the model to express the underlying sentiment of the image by incorporating appropriate sentimental elements without any supervision. To tackle this challenging task, we propose a method that integrates commonsense knowledge of sentimental relationships into the caption generation process. This is inspired by the fact that human sentimental expressions usually follow certain rules and have specific describing patterns for different entities and emotion combinations. Our method consists of four key components, including a commonsense knowledge base of sentimental relationships, a factual sentence decoder, a sentimental sentence decoder, and a visual information extraction module. Specifically, the commonsense knowledge base of sentimental relationships is constructed from an external corpus, where the sentimental relationship represents the correlation between an entity and a sentimental description in a specific sentiment. Our method adopts a two-phase generation strategy, which first generates a factual sentence with masked sentimental parts, and then fills the masked parts with highly image-relevant sentimental words inferred from the commonsense sentimental relationships. To effectively train the model using unpaired images and sentimental corpus, we design a novel sentimental reward in reinforcement learning that aligns generated sentimental captions with commonsense knowledge. This new reward is calculated by evaluating how reasonable the generated sentimental words are, according to the commonsense knowledge of sentimental relationships, in order to encourage the model to pay more attention to the sentimental part of a sentence. Moreover, to address the problem of existing metrics that independently evaluate the content relevance and sentiment consistency, we propose a new metric called SentiCLIPScore. This novel metric jointly assesses both the factual and sentimental aspects of captions, where the content relevance is measured by the pre-trained multimodal model CLIP, and the sentimental consistency is synthetically measured by the sentence sentiment class and the constructed sentimental relationship knowledge base. Experiments on the COCO and Flickr30k image datasets demonstrate the efficacy of our method. Compared with unsupervised baselines, our method improves SentiCLIPScore by 4% and 13% on COCO and Flickr30K, respectively.
| 投稿的翻译标题 | Unsupervised Sentimental Image Captioning via Commonsense Knowledge |
|---|---|
| 源语言 | 繁体中文 |
| 页(从-至) | 2801-2821 |
| 页数 | 21 |
| 期刊 | Jisuanji Xuebao/Chinese Journal of Computers |
| 卷 | 48 |
| 期 | 12 |
| DOI | |
| 出版状态 | 已出版 - 12月 2025 |
| 已对外发布 | 是 |
关键词
- commonsense
- sentimental relationship
- unsupervised sentimental image captioning
- visual sentimental analysis
指纹
探究 '结合常识知识的无监督情感化图像描述生成' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver