Extracting Key Segments of Videos for Event Detection by Learning from Web Sources

Hao Song; Xinxiao Wu; Wennan Yu; Yunde Jia

doi:10.1109/TMM.2017.2763322

Extracting Key Segments of Videos for Event Detection by Learning from Web Sources

Hao Song, Xinxiao Wu^*, Wennan Yu, Yunde Jia

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

16 Citations (Scopus)

Abstract

In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.

Original language	English
Pages (from-to)	1088-1100
Number of pages	13
Journal	IEEE Transactions on Multimedia
Volume	20
Issue number	5
DOIs	https://doi.org/10.1109/TMM.2017.2763322
Publication status	Published - May 2018

Keywords

Event detection
automatic concept discovery
key segments
transfer learning

Access to Document

10.1109/TMM.2017.2763322

Cite this

@article{aaa52e176f534a9ab3223d9826b9d5d3,

title = "Extracting Key Segments of Videos for Event Detection by Learning from Web Sources",

abstract = "In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.",

keywords = "Event detection, automatic concept discovery, key segments, transfer learning",

author = "Hao Song and Xinxiao Wu and Wennan Yu and Yunde Jia",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2018",

month = may,

doi = "10.1109/TMM.2017.2763322",

language = "English",

volume = "20",

pages = "1088--1100",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "5",

}

TY - JOUR

T1 - Extracting Key Segments of Videos for Event Detection by Learning from Web Sources

AU - Song, Hao

AU - Wu, Xinxiao

AU - Yu, Wennan

AU - Jia, Yunde

PY - 2018/5

Y1 - 2018/5

N2 - In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.

AB - In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.

KW - Event detection

KW - automatic concept discovery

KW - key segments

KW - transfer learning

UR - http://www.scopus.com/inward/record.url?scp=85046012521&partnerID=8YFLogxK

U2 - 10.1109/TMM.2017.2763322

DO - 10.1109/TMM.2017.2763322

M3 - Article

AN - SCOPUS:85046012521

SN - 1520-9210

VL - 20

SP - 1088

EP - 1100

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

IS - 5

ER -

Extracting Key Segments of Videos for Event Detection by Learning from Web Sources

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this