A survey of Web page cleaning research

Xianling Mao*, Jing He, Hongfei Yan

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

7 引用 (Scopus)

摘要

The rapid development of the Internet has made a variety of Web applications and Web data, which become the major source of data for lots of research. Web page includes a variety of content, such as advertising, navigation bar, related links, text, etc. However, for different studies and applications, not all content is necessary; oppositely, the unrelated content will affect the effectiveness and efficiency of the research and applications. So Web page cleaning is a highlighted topic of information retrieval with booming search engines. Thus it is necessary to sum up the field on the page de-noise, in order to better carry out in-depth study. Firstly, this paper gives a brief introduction to the necessity of Web page cleaning and its related concepts. The authors present a classification hierarchy of the Web page cleaning methods, including the single-model based Web page cleaning methods and the multi-model based Web page cleaning methods. Then, this paper summarizes all kinds of Web page cleaning techniques and frameworks, including SST, Shingle, Pagelet, DSE, etc. Thirdly, this paper describes the experimental datasets and experimental methods used in all kinds of Web page cleaning techniques. Finally, this paper discusses the existing problems and the future directions in the Web page cleaning field.

源语言英语
页(从-至)2025-2036
页数12
期刊Jisuanji Yanjiu yu Fazhan/Computer Research and Development
47
12
出版状态已出版 - 12月 2010
已对外发布

指纹

探究 'A survey of Web page cleaning research' 的科研主题。它们共同构成独一无二的指纹。

引用此