TY - JOUR
T1 - A survey of Web page cleaning research
AU - Mao, Xianling
AU - He, Jing
AU - Yan, Hongfei
PY - 2010/12
Y1 - 2010/12
N2 - The rapid development of the Internet has made a variety of Web applications and Web data, which become the major source of data for lots of research. Web page includes a variety of content, such as advertising, navigation bar, related links, text, etc. However, for different studies and applications, not all content is necessary; oppositely, the unrelated content will affect the effectiveness and efficiency of the research and applications. So Web page cleaning is a highlighted topic of information retrieval with booming search engines. Thus it is necessary to sum up the field on the page de-noise, in order to better carry out in-depth study. Firstly, this paper gives a brief introduction to the necessity of Web page cleaning and its related concepts. The authors present a classification hierarchy of the Web page cleaning methods, including the single-model based Web page cleaning methods and the multi-model based Web page cleaning methods. Then, this paper summarizes all kinds of Web page cleaning techniques and frameworks, including SST, Shingle, Pagelet, DSE, etc. Thirdly, this paper describes the experimental datasets and experimental methods used in all kinds of Web page cleaning techniques. Finally, this paper discusses the existing problems and the future directions in the Web page cleaning field.
AB - The rapid development of the Internet has made a variety of Web applications and Web data, which become the major source of data for lots of research. Web page includes a variety of content, such as advertising, navigation bar, related links, text, etc. However, for different studies and applications, not all content is necessary; oppositely, the unrelated content will affect the effectiveness and efficiency of the research and applications. So Web page cleaning is a highlighted topic of information retrieval with booming search engines. Thus it is necessary to sum up the field on the page de-noise, in order to better carry out in-depth study. Firstly, this paper gives a brief introduction to the necessity of Web page cleaning and its related concepts. The authors present a classification hierarchy of the Web page cleaning methods, including the single-model based Web page cleaning methods and the multi-model based Web page cleaning methods. Then, this paper summarizes all kinds of Web page cleaning techniques and frameworks, including SST, Shingle, Pagelet, DSE, etc. Thirdly, this paper describes the experimental datasets and experimental methods used in all kinds of Web page cleaning techniques. Finally, this paper discusses the existing problems and the future directions in the Web page cleaning field.
KW - Data mining
KW - Information retrieval
KW - WWW
KW - Web mining
KW - Web page cleaning
UR - http://www.scopus.com/inward/record.url?scp=78650954485&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:78650954485
SN - 1000-1239
VL - 47
SP - 2025
EP - 2036
JO - Jisuanji Yanjiu yu Fazhan/Computer Research and Development
JF - Jisuanji Yanjiu yu Fazhan/Computer Research and Development
IS - 12
ER -