Abstract
We propose a new task called sentimental visual captioning that generates captions with the inherent sentiment reflected by the input image or video. Compared with the stylized visual captioning task that requires a predefined style independent of the image or video, our new task automatically analyzes the inherent sentiment tendency from the visual content. With this in mind, we propose a multimodal Transformer model namely Senti-Transformer for sentimental visual captioning, which integrates both content and sentiment information from multiple modalities and incorporates prior sentimental knowledge to generate sentimental sentence. Specifically, we extract prior knowledge from sentimental corpus to obtain sentimental textual information and design a multi-head Transformer encoder to encode multimodal features. Then we decompose the attention layer in the middle of Transformer decoder to focus on important features of each modality, and the attended features are integrated through an intra- and inter-modality fusion mechanism for generating sentimental sentences. To effectively train the proposed model using the external sentimental corpus as well as the paired images or videos and factual sentences in existing captioning datasets, we propose a two-stage training strategy that first learns to incorporate sentimental elements into the sentences via a regularization term and then learns to generate fluent and relevant sentences with the inherent sentimental styles via reinforcement learning with a sentimental reward. Extensive experiments on both image and video datasets demonstrate the effectiveness and superiority of our Senti-Transformer on sentimental visual captioning. Source code is available at https://github.com/ezeli/InSentiCap_ext.
| Original language | English |
|---|---|
| Pages (from-to) | 1073-1090 |
| Number of pages | 18 |
| Journal | International Journal of Computer Vision |
| Volume | 131 |
| Issue number | 4 |
| DOIs | |
| Publication status | Published - Apr 2023 |
Keywords
- Sentimental visual captioning
- Transformer
- Visual sentiment analysis
Fingerprint
Dive into the research topics of 'Sentimental Visual Captioning using Multimodal Transformer'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver