TY - GEN
T1 - Rebuilding visual vocabulary via spatial-temporal context similarity for video retrieval
AU - Wang, Lei
AU - Elyan, Eyad
AU - Song, Dawei
PY - 2014
Y1 - 2014
N2 - The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation methods for large-scale content-based video retrieval. The visual words are quantized according to a visual vocabulary, which is generated by a visual features clustering process (e.g. K-means, GMM, etc). In principle, two types of errors can occur in the quantization process. They are referred to as the UnderQuantize and OverQuantize problems. The former causes ambiguities and often leads to false visual content matches, while the latter generates synonyms and may lead to missing true matches. Unlike most state-of-the-art research that concentrated on enhancing the BovW model by disambiguating the visual words, in this paper, we aim to address the OverQuantize problem by incorporating the similarity of spatial-temporal contexts associated to pair-wise visual words. The visual words with similar context and appearance are assumed to be synonyms. These synonyms in the initial visual vocabulary are then merged to rebuild a more compact and descriptive vocabulary. Our approach was evaluated on the TRECVID2002 and CC-WEB-VIDEO datasets for two typical Query-By-Example (QBE) video retrieval applications. Experimental results demonstrated substantial improvements in retrieval performance over the initial visual vocabulary generated by the BovW model. We also show that our approach can be utilized in combination with the state-of-the-art disambiguation method to further improve the performance of the QBE video retrieval.
AB - The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation methods for large-scale content-based video retrieval. The visual words are quantized according to a visual vocabulary, which is generated by a visual features clustering process (e.g. K-means, GMM, etc). In principle, two types of errors can occur in the quantization process. They are referred to as the UnderQuantize and OverQuantize problems. The former causes ambiguities and often leads to false visual content matches, while the latter generates synonyms and may lead to missing true matches. Unlike most state-of-the-art research that concentrated on enhancing the BovW model by disambiguating the visual words, in this paper, we aim to address the OverQuantize problem by incorporating the similarity of spatial-temporal contexts associated to pair-wise visual words. The visual words with similar context and appearance are assumed to be synonyms. These synonyms in the initial visual vocabulary are then merged to rebuild a more compact and descriptive vocabulary. Our approach was evaluated on the TRECVID2002 and CC-WEB-VIDEO datasets for two typical Query-By-Example (QBE) video retrieval applications. Experimental results demonstrated substantial improvements in retrieval performance over the initial visual vocabulary generated by the BovW model. We also show that our approach can be utilized in combination with the state-of-the-art disambiguation method to further improve the performance of the QBE video retrieval.
KW - Bag-of-visual-Word
KW - Content based Video Retrieval
KW - Spatial-Temporal Context
KW - Synonyms
KW - Visual Vocabulary
UR - http://www.scopus.com/inward/record.url?scp=84893443286&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-04114-8_7
DO - 10.1007/978-3-319-04114-8_7
M3 - Conference contribution
AN - SCOPUS:84893443286
SN - 9783319041131
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 74
EP - 85
BT - MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014, Proceedings
T2 - 20th Anniversary International Conference on MultiMedia Modeling, MMM 2014
Y2 - 6 January 2014 through 10 January 2014
ER -