摘要
In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.
源语言 | 英语 |
---|---|
页(从-至) | 1088-1100 |
页数 | 13 |
期刊 | IEEE Transactions on Multimedia |
卷 | 20 |
期 | 5 |
DOI | |
出版状态 | 已出版 - 5月 2018 |