TY - JOUR
T1 - Near duplicated web pages detection based on concept and semantic network
AU - Cao, Yu Juan
AU - Niu, Zhen Dong
AU - Zhao, Kun
AU - Peng, Xue Ping
PY - 2011/8
Y1 - 2011/8
N2 - Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the "expression difference" problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.
AB - Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the "expression difference" problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.
KW - Duplicate removal algorithm
KW - Near duplicated Web page
KW - Small world network
KW - Standard deviation
UR - http://www.scopus.com/inward/record.url?scp=80052073309&partnerID=8YFLogxK
U2 - 10.3724/SP.J.1001.2011.03890
DO - 10.3724/SP.J.1001.2011.03890
M3 - Article
AN - SCOPUS:80052073309
SN - 1000-9825
VL - 22
SP - 1816
EP - 1826
JO - Ruan Jian Xue Bao/Journal of Software
JF - Ruan Jian Xue Bao/Journal of Software
IS - 8
ER -