Near duplicated web pages detection based on concept and semantic network

Yu Juan Cao*, Zhen Dong Niu, Kun Zhao, Xue Ping Peng

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

7 引用 (Scopus)

摘要

Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the "expression difference" problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.

源语言英语
页(从-至)1816-1826
页数11
期刊Ruan Jian Xue Bao/Journal of Software
22
8
DOI
出版状态已出版 - 8月 2011

指纹

探究 'Near duplicated web pages detection based on concept and semantic network' 的科研主题。它们共同构成独一无二的指纹。

引用此