SizeSpotSigs: An effective deduplicate algorithm considering the size of page content

Xianling Mao*, Xiaobing Liu, Nan Di, Xiaoming Li, Hongfei Yan

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Citations (Scopus)

Abstract

Detecting if two Web pages are near replicas, in terms of their contents rather than files, is of great importance in many web information based applications. As a result, many deduplicating algorithms have been proposed. Nevertheless, analysis and experiments show that existing algorithms usually don't work well for short Web pages, due to relatively large portion of noisy information, such as ads and templates for websites, existing in the corresponding files. In this paper, we analyze the critical issues in deduplicating short Web pages and present an algorithm (AF-SpotSigs) that incorporates them, which could work 15% better than the state-of-the-art method. Then we propose an algorithm (SizeSpotSigs), taking the size of page contents into account, which could handle both short and long Web pages. The contributions of SizeSpotSigs are three-fold: 1) Provide an analysis about the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2) Based on the analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages; 3) We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. Experiments confirm that SizeSpotSigs works better than state-of-the-art approaches such as SpotSigs, over a demonstrative Mixer of manually assessed near-duplicate news articles, which include both short and long Web pages.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings
PublisherSpringer Verlag
Pages537-548
Number of pages12
EditionPART 1
ISBN (Print)9783642208409
DOIs
Publication statusPublished - 2011
Externally publishedYes

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume6634 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Keywords

  • AF-SpotSigs
  • Deduplicate
  • Information Retrieval
  • Near Duplicate Detection
  • SizeSpotSigs

Fingerprint

Dive into the research topics of 'SizeSpotSigs: An effective deduplicate algorithm considering the size of page content'. Together they form a unique fingerprint.

Cite this