A survey of Web page cleaning research

Xianling Mao*, Jing He, Hongfei Yan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

7 Citations (Scopus)

Abstract

The rapid development of the Internet has made a variety of Web applications and Web data, which become the major source of data for lots of research. Web page includes a variety of content, such as advertising, navigation bar, related links, text, etc. However, for different studies and applications, not all content is necessary; oppositely, the unrelated content will affect the effectiveness and efficiency of the research and applications. So Web page cleaning is a highlighted topic of information retrieval with booming search engines. Thus it is necessary to sum up the field on the page de-noise, in order to better carry out in-depth study. Firstly, this paper gives a brief introduction to the necessity of Web page cleaning and its related concepts. The authors present a classification hierarchy of the Web page cleaning methods, including the single-model based Web page cleaning methods and the multi-model based Web page cleaning methods. Then, this paper summarizes all kinds of Web page cleaning techniques and frameworks, including SST, Shingle, Pagelet, DSE, etc. Thirdly, this paper describes the experimental datasets and experimental methods used in all kinds of Web page cleaning techniques. Finally, this paper discusses the existing problems and the future directions in the Web page cleaning field.

Original languageEnglish
Pages (from-to)2025-2036
Number of pages12
JournalJisuanji Yanjiu yu Fazhan/Computer Research and Development
Volume47
Issue number12
Publication statusPublished - Dec 2010
Externally publishedYes

Keywords

  • Data mining
  • Information retrieval
  • WWW
  • Web mining
  • Web page cleaning

Fingerprint

Dive into the research topics of 'A survey of Web page cleaning research'. Together they form a unique fingerprint.

Cite this