TY - GEN
T1 - The study on detecting near-duplicate WebPages
AU - Cao, Yu Juan
AU - Niu, Zhen Dong
AU - Wang, Wei Qiang
AU - Zhao, Kun
PY - 2008
Y1 - 2008
N2 - Reprinting information among websites produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, an algorithm to Detect near-Duplicate WebPages (DDW) is proposed. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we consider both syntactic and semantic information to present and compute documents' similarities. Second, after classifying web-pages into different categories, we index feature in each category then search for nearduplicates only in the same category. From Google searching results for 72 queries, we select 5835 nearduplicate WebPages manually. Then insert them into an existing collection which contains about 768, 763 WebPages, as the test data. The experimental results demonstrate that our approach outperforms I-Match algorithms. In large-scale test, approximate linear time and space complexity are gotten.
AB - Reprinting information among websites produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, an algorithm to Detect near-Duplicate WebPages (DDW) is proposed. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we consider both syntactic and semantic information to present and compute documents' similarities. Second, after classifying web-pages into different categories, we index feature in each category then search for nearduplicates only in the same category. From Google searching results for 72 queries, we select 5835 nearduplicate WebPages manually. Then insert them into an existing collection which contains about 768, 763 WebPages, as the test data. The experimental results demonstrate that our approach outperforms I-Match algorithms. In large-scale test, approximate linear time and space complexity are gotten.
UR - http://www.scopus.com/inward/record.url?scp=51849093538&partnerID=8YFLogxK
U2 - 10.1109/CIT.2008.4594656
DO - 10.1109/CIT.2008.4594656
M3 - Conference contribution
AN - SCOPUS:51849093538
SN - 9781424423583
T3 - Proceedings - 2008 IEEE 8th International Conference on Computer and Information Technology, CIT 2008
SP - 95
EP - 100
BT - Proceedings - 2008 IEEE 8th International Conference on Computer and Information Technology, CIT 2008
T2 - 2008 IEEE 8th International Conference on Computer and Information Technology, CIT 2008
Y2 - 8 July 2008 through 11 July 2008
ER -