The study on detecting near-duplicate WebPages

Yu Juan Cao*, Zhen Dong Niu, Wei Qiang Wang, Kun Zhao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Citations (Scopus)

Abstract

Reprinting information among websites produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, an algorithm to Detect near-Duplicate WebPages (DDW) is proposed. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we consider both syntactic and semantic information to present and compute documents' similarities. Second, after classifying web-pages into different categories, we index feature in each category then search for nearduplicates only in the same category. From Google searching results for 72 queries, we select 5835 nearduplicate WebPages manually. Then insert them into an existing collection which contains about 768, 763 WebPages, as the test data. The experimental results demonstrate that our approach outperforms I-Match algorithms. In large-scale test, approximate linear time and space complexity are gotten.

Original languageEnglish
Title of host publicationProceedings - 2008 IEEE 8th International Conference on Computer and Information Technology, CIT 2008
Pages95-100
Number of pages6
DOIs
Publication statusPublished - 2008
Event2008 IEEE 8th International Conference on Computer and Information Technology, CIT 2008 - Sydney, NSW, Australia
Duration: 8 Jul 200811 Jul 2008

Publication series

NameProceedings - 2008 IEEE 8th International Conference on Computer and Information Technology, CIT 2008

Conference

Conference2008 IEEE 8th International Conference on Computer and Information Technology, CIT 2008
Country/TerritoryAustralia
CitySydney, NSW
Period8/07/0811/07/08

Fingerprint

Dive into the research topics of 'The study on detecting near-duplicate WebPages'. Together they form a unique fingerprint.

Cite this