Abstract
In this paper, we propose an approach of inferring the labels of unlabeled consumer videos and at the same time recognizing the key segments of the videos by learning from Web image sets for video annotation. The key segments of the videos are automatically recognized by transferring the knowledge learned from related Web image sets to the videos. We introduce an adaptive latent structural SVM method to adapt the pre-learned classifiers using Web image sets to an optimal target classifier, where the locations of the key segments are modeled as latent variables because the ground-truth of key segments are not available. We utilize a limited number of labeled videos and abundant labeled Web images for training annotation models, which significantly alleviates the time-consuming and labor-expensive collection of a large number of labeled training videos. Experiment on the two challenge datasets Columbia’s Consumer Video (CCV) and TRECVID 2014 Multimedia Event Detection (MED2014) shows our method performs better than state-of-art methods.
Original language | English |
---|---|
Pages (from-to) | 6111-6126 |
Number of pages | 16 |
Journal | Multimedia Tools and Applications |
Volume | 76 |
Issue number | 5 |
DOIs | |
Publication status | Published - 1 Mar 2017 |
Keywords
- Image set
- Key segment
- Transfer learning
- Video annotation