Near duplicated web pages detection based on concept and semantic network

Yu Juan Cao; Zhen Dong Niu; Kun Zhao; Xue Ping Peng

doi:10.3724/SP.J.1001.2011.03890

Near duplicated web pages detection based on concept and semantic network

Yu Juan Cao^*, Zhen Dong Niu, Kun Zhao, Xue Ping Peng

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

7 Citations (Scopus)

Abstract

Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the "expression difference" problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.

Original language	English
Pages (from-to)	1816-1826
Number of pages	11
Journal	Ruan Jian Xue Bao/Journal of Software
Volume	22
Issue number	8
DOIs	https://doi.org/10.3724/SP.J.1001.2011.03890
Publication status	Published - Aug 2011

Keywords

Duplicate removal algorithm
Near duplicated Web page
Small world network
Standard deviation

Access to Document

10.3724/SP.J.1001.2011.03890

Cite this

@article{0f666ada8a324412b8875d5a70e54afb,

title = "Near duplicated web pages detection based on concept and semantic network",

abstract = "Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the {"}expression difference{"} problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.",

keywords = "Duplicate removal algorithm, Near duplicated Web page, Small world network, Standard deviation",

author = "Cao, {Yu Juan} and Niu, {Zhen Dong} and Kun Zhao and Peng, {Xue Ping}",

year = "2011",

month = aug,

doi = "10.3724/SP.J.1001.2011.03890",

language = "English",

volume = "22",

pages = "1816--1826",

journal = "Ruan Jian Xue Bao/Journal of Software",

issn = "1000-9825",

publisher = "Chinese Academy of Sciences",

number = "8",

}

TY - JOUR

T1 - Near duplicated web pages detection based on concept and semantic network

AU - Cao, Yu Juan

AU - Niu, Zhen Dong

AU - Zhao, Kun

AU - Peng, Xue Ping

PY - 2011/8

Y1 - 2011/8

N2 - Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the "expression difference" problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.

AB - Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the "expression difference" problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.

KW - Duplicate removal algorithm

KW - Near duplicated Web page

KW - Small world network

KW - Standard deviation

UR - http://www.scopus.com/inward/record.url?scp=80052073309&partnerID=8YFLogxK

U2 - 10.3724/SP.J.1001.2011.03890

DO - 10.3724/SP.J.1001.2011.03890

M3 - Article

AN - SCOPUS:80052073309

SN - 1000-9825

VL - 22

SP - 1816

EP - 1826

JO - Ruan Jian Xue Bao/Journal of Software

JF - Ruan Jian Xue Bao/Journal of Software

IS - 8

ER -

Near duplicated web pages detection based on concept and semantic network

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this