Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval

Lei Wang; Eyad Elyan; Dawei Song

doi:10.1007/978-3-319-04114-8_7

Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval

Lei Wang, Eyad Elyan, Dawei Song

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

2 引用（Scopus）

摘要

The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation methods for large-scale content-based video retrieval. The visual words are quantized according to a visual vocabulary, which is generated by a visual features clustering process (e.g. K-means, GMM, etc). In principle, two types of errors can occur in the quantization process. They are referred to as the UnderQuantize and OverQuantize problems. The former causes ambiguities and often leads to false visual content matches, while the latter generates synonyms and may lead to missing true matches. Unlike most state-of-the-art research that concentrated on enhancing the BovW model by disambiguating the visual words, in this paper, we aim to address the OverQuantize problem by incorporating the similarity of spatial-temporal contexts associated to pair-wise visual words. The visual words with similar context and appearance are assumed to be synonyms. These synonyms in the initial visual vocabulary are then merged to rebuild a more compact and descriptive vocabulary. Our approach was evaluated on the TRECVID2002 and CC-WEB-VIDEO datasets for two typical Query-By-Example (QBE) video retrieval applications. Experimental results demonstrated substantial improvements in retrieval performance over the initial visual vocabulary generated by the BovW model. We also show that our approach can be utilized in combination with the state-of-the-art disambiguation method to further improve the performance of the QBE video retrieval.

源语言	英语
主期刊名	MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014, Proceedings
页	74-85
页数	12
版本	PART 1
DOI	https://doi.org/10.1007/978-3-319-04114-8_7
出版状态	已出版 - 2014
已对外发布	是
活动	20th Anniversary International Conference on MultiMedia Modeling, MMM 2014 - Dublin, 爱尔兰期限: 6 1月 2014 → 10 1月 2014

出版系列

姓名	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
编号	PART 1
卷	8325 LNCS
ISSN（印刷版）	0302-9743
ISSN（电子版）	1611-3349

会议

会议	20th Anniversary International Conference on MultiMedia Modeling, MMM 2014
国家/地区	爱尔兰
市	Dublin
时期	6/01/14 → 10/01/14

访问文件

10.1007/978-3-319-04114-8_7

其它文件与链接

链接到 Scopus 的出版物

引用此

Wang, L., Elyan, E., & Song, D. (2014). Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval. 在 MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014, Proceedings (PART 1 编辑, 页码 74-85). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 卷 8325 LNCS, 号码 PART 1). https://doi.org/10.1007/978-3-319-04114-8_7

Wang, Lei ; Elyan, Eyad ; Song, Dawei. / Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval. MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014, Proceedings. PART 1. 编辑 2014. 页码 74-85 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1).

@inproceedings{98b1b158807f4b729c56076264dbc48a,

title = "Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval",

abstract = "The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation methods for large-scale content-based video retrieval. The visual words are quantized according to a visual vocabulary, which is generated by a visual features clustering process (e.g. K-means, GMM, etc). In principle, two types of errors can occur in the quantization process. They are referred to as the UnderQuantize and OverQuantize problems. The former causes ambiguities and often leads to false visual content matches, while the latter generates synonyms and may lead to missing true matches. Unlike most state-of-the-art research that concentrated on enhancing the BovW model by disambiguating the visual words, in this paper, we aim to address the OverQuantize problem by incorporating the similarity of spatial-temporal contexts associated to pair-wise visual words. The visual words with similar context and appearance are assumed to be synonyms. These synonyms in the initial visual vocabulary are then merged to rebuild a more compact and descriptive vocabulary. Our approach was evaluated on the TRECVID2002 and CC-WEB-VIDEO datasets for two typical Query-By-Example (QBE) video retrieval applications. Experimental results demonstrated substantial improvements in retrieval performance over the initial visual vocabulary generated by the BovW model. We also show that our approach can be utilized in combination with the state-of-the-art disambiguation method to further improve the performance of the QBE video retrieval.",

keywords = "Bag-of-visual-Word, Content based Video Retrieval, Spatial-Temporal Context, Synonyms, Visual Vocabulary",

author = "Lei Wang and Eyad Elyan and Dawei Song",

year = "2014",

doi = "10.1007/978-3-319-04114-8_7",

language = "English",

isbn = "9783319041131",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

number = "PART 1",

pages = "74--85",

booktitle = "MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014, Proceedings",

edition = "PART 1",

note = "20th Anniversary International Conference on MultiMedia Modeling, MMM 2014 ; Conference date: 06-01-2014 Through 10-01-2014",

}

Wang, L, Elyan, E & Song, D 2014, Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval. 在 MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014, Proceedings. PART 1 编辑, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 号码 PART 1, 卷 8325 LNCS, 页码 74-85, 20th Anniversary International Conference on MultiMedia Modeling, MMM 2014, Dublin, 爱尔兰, 6/01/14. https://doi.org/10.1007/978-3-319-04114-8_7

Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval. / Wang, Lei; Elyan, Eyad; Song, Dawei.
MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014, Proceedings. PART 1. 编辑 2014. 页码 74-85 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 卷 8325 LNCS, 号码 PART 1).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval

AU - Wang, Lei

AU - Elyan, Eyad

AU - Song, Dawei

PY - 2014

Y1 - 2014

N2 - The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation methods for large-scale content-based video retrieval. The visual words are quantized according to a visual vocabulary, which is generated by a visual features clustering process (e.g. K-means, GMM, etc). In principle, two types of errors can occur in the quantization process. They are referred to as the UnderQuantize and OverQuantize problems. The former causes ambiguities and often leads to false visual content matches, while the latter generates synonyms and may lead to missing true matches. Unlike most state-of-the-art research that concentrated on enhancing the BovW model by disambiguating the visual words, in this paper, we aim to address the OverQuantize problem by incorporating the similarity of spatial-temporal contexts associated to pair-wise visual words. The visual words with similar context and appearance are assumed to be synonyms. These synonyms in the initial visual vocabulary are then merged to rebuild a more compact and descriptive vocabulary. Our approach was evaluated on the TRECVID2002 and CC-WEB-VIDEO datasets for two typical Query-By-Example (QBE) video retrieval applications. Experimental results demonstrated substantial improvements in retrieval performance over the initial visual vocabulary generated by the BovW model. We also show that our approach can be utilized in combination with the state-of-the-art disambiguation method to further improve the performance of the QBE video retrieval.

AB - The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation methods for large-scale content-based video retrieval. The visual words are quantized according to a visual vocabulary, which is generated by a visual features clustering process (e.g. K-means, GMM, etc). In principle, two types of errors can occur in the quantization process. They are referred to as the UnderQuantize and OverQuantize problems. The former causes ambiguities and often leads to false visual content matches, while the latter generates synonyms and may lead to missing true matches. Unlike most state-of-the-art research that concentrated on enhancing the BovW model by disambiguating the visual words, in this paper, we aim to address the OverQuantize problem by incorporating the similarity of spatial-temporal contexts associated to pair-wise visual words. The visual words with similar context and appearance are assumed to be synonyms. These synonyms in the initial visual vocabulary are then merged to rebuild a more compact and descriptive vocabulary. Our approach was evaluated on the TRECVID2002 and CC-WEB-VIDEO datasets for two typical Query-By-Example (QBE) video retrieval applications. Experimental results demonstrated substantial improvements in retrieval performance over the initial visual vocabulary generated by the BovW model. We also show that our approach can be utilized in combination with the state-of-the-art disambiguation method to further improve the performance of the QBE video retrieval.

KW - Bag-of-visual-Word

KW - Content based Video Retrieval

KW - Spatial-Temporal Context

KW - Synonyms

KW - Visual Vocabulary

UR - http://www.scopus.com/inward/record.url?scp=84893443286&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-04114-8_7

DO - 10.1007/978-3-319-04114-8_7

M3 - Conference contribution

AN - SCOPUS:84893443286

SN - 9783319041131

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 74

EP - 85

BT - MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014, Proceedings

T2 - 20th Anniversary International Conference on MultiMedia Modeling, MMM 2014

Y2 - 6 January 2014 through 10 January 2014

ER -

Wang L, Elyan E, Song D. Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval. 在 MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014, Proceedings. PART 1 编辑 2014. 页码 74-85. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1). doi: 10.1007/978-3-319-04114-8_7

Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此