Extraction of informative blocks from web pages

Yu Juan Cao*, Zhen Dong Niu, Liu Ling Dai, Yu Ming Zhao

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

12 引用 (Scopus)

摘要

Typically Web pages always contain a large amount of banner ads, navigation bars, and copyright notices etc. Such irrelevant information is not part of the main contents of the pages, they will seriously harm Web mining and searching. In this paper, we develop and evaluate a method that utilizes both the visual features and the semantic information to extract informative blocks. We first partition a web page into semantic blocks using vision-based page segmentation. The visual and the semantic information got by LSI (Latent Semantic Indexing) are extracted to form the feature-vector of the block. Second we manually annotate informative or uninformative labels to the blocks. The labeled blocks are used as training dataset to train a classification model. Then the informative blocks can be extracted through the model. Our experiments show that the proposed EIBA (Extract Informative Block Arithmetic) is able to dramatically improve the results in near-duplicate detection and classification tasks.

源语言英语
主期刊名Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology
544-549
页数6
DOI
出版状态已出版 - 2008
活动ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology - Liaoning, 中国
期限: 23 7月 200825 7月 2008

出版系列

姓名Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology

会议

会议ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology
国家/地区中国
Liaoning
时期23/07/0825/07/08

指纹

探究 'Extraction of informative blocks from web pages' 的科研主题。它们共同构成独一无二的指纹。

引用此