Extracting Key Segments of Videos for Event Detection by Learning from Web Sources

Hao Song; Xinxiao Wu; Wennan Yu; Yunde Jia

doi:10.1109/TMM.2017.2763322

Extracting Key Segments of Videos for Event Detection by Learning from Web Sources

Hao Song, Xinxiao Wu^*, Wennan Yu, Yunde Jia

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

16 引用（Scopus）

摘要

In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.

源语言	英语
页（从-至）	1088-1100
页数	13
期刊	IEEE Transactions on Multimedia
卷	20
期	5
DOI	https://doi.org/10.1109/TMM.2017.2763322
出版状态	已出版 - 5月 2018

访问文件

10.1109/TMM.2017.2763322

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{aaa52e176f534a9ab3223d9826b9d5d3,

title = "Extracting Key Segments of Videos for Event Detection by Learning from Web Sources",

abstract = "In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.",

keywords = "Event detection, automatic concept discovery, key segments, transfer learning",

author = "Hao Song and Xinxiao Wu and Wennan Yu and Yunde Jia",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2018",

month = may,

doi = "10.1109/TMM.2017.2763322",

language = "English",

volume = "20",

pages = "1088--1100",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "5",

}

TY - JOUR

T1 - Extracting Key Segments of Videos for Event Detection by Learning from Web Sources

AU - Song, Hao

AU - Wu, Xinxiao

AU - Yu, Wennan

AU - Jia, Yunde

PY - 2018/5

Y1 - 2018/5

N2 - In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.

AB - In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.

KW - Event detection

KW - automatic concept discovery

KW - key segments

KW - transfer learning

UR - http://www.scopus.com/inward/record.url?scp=85046012521&partnerID=8YFLogxK

U2 - 10.1109/TMM.2017.2763322

DO - 10.1109/TMM.2017.2763322

M3 - Article

AN - SCOPUS:85046012521

SN - 1520-9210

VL - 20

SP - 1088

EP - 1100

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

IS - 5

ER -

Extracting Key Segments of Videos for Event Detection by Learning from Web Sources

摘要

访问文件

其它文件与链接

指纹

引用此