Extraction of informative blocks from web pages

Yu Juan Cao*, Zhen Dong Niu, Liu Ling Dai, Yu Ming Zhao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Citations (Scopus)

Abstract

Typically Web pages always contain a large amount of banner ads, navigation bars, and copyright notices etc. Such irrelevant information is not part of the main contents of the pages, they will seriously harm Web mining and searching. In this paper, we develop and evaluate a method that utilizes both the visual features and the semantic information to extract informative blocks. We first partition a web page into semantic blocks using vision-based page segmentation. The visual and the semantic information got by LSI (Latent Semantic Indexing) are extracted to form the feature-vector of the block. Second we manually annotate informative or uninformative labels to the blocks. The labeled blocks are used as training dataset to train a classification model. Then the informative blocks can be extracted through the model. Our experiments show that the proposed EIBA (Extract Informative Block Arithmetic) is able to dramatically improve the results in near-duplicate detection and classification tasks.

Original languageEnglish
Title of host publicationProceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology
Pages544-549
Number of pages6
DOIs
Publication statusPublished - 2008
EventALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology - Liaoning, China
Duration: 23 Jul 200825 Jul 2008

Publication series

NameProceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology

Conference

ConferenceALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology
Country/TerritoryChina
CityLiaoning
Period23/07/0825/07/08

Keywords

  • Data mining
  • Information extraction
  • LSI
  • SVM
  • VIPS
  • Web
  • Web page segmentation

Fingerprint

Dive into the research topics of 'Extraction of informative blocks from web pages'. Together they form a unique fingerprint.

Cite this