Near duplicated web pages detection based on concept and semantic network

Yu Juan Cao*, Zhen Dong Niu, Kun Zhao, Xue Ping Peng

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

7 Citations (Scopus)

Abstract

Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the "expression difference" problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.

Original languageEnglish
Pages (from-to)1816-1826
Number of pages11
JournalRuan Jian Xue Bao/Journal of Software
Volume22
Issue number8
DOIs
Publication statusPublished - Aug 2011

Keywords

  • Duplicate removal algorithm
  • Near duplicated Web page
  • Small world network
  • Standard deviation

Fingerprint

Dive into the research topics of 'Near duplicated web pages detection based on concept and semantic network'. Together they form a unique fingerprint.

Cite this