跳到主要导航 跳到搜索 跳到主要内容

DOM based content extraction via text density

  • Fei Sun*
  • , Dandan Song
  • , Lejian Liao
  • *此作品的通讯作者
  • Beijing Institute of Technology

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

In addition to the main content, most web pages also contain navigation panels, advertisements and copyright and disclaimer notices. This additional content, which is also known as noise, is typically not related to the main subject and may hamper the performance of web data mining, and hence needs to be removed properly. In this paper, we present Content Extraction via Text Density (CETD)- a fast, accurate and general method for extracting content from diverse web pages, and using DOM (Document Object Model) node text density to preserve the original structure. For this purpose, we introduce two concepts to measure the importance of nodes: Text Density and Composite Text Density. In order to extract content intact, we propose a technique called DensitySum to replace Data Smoothing. The approach was evaluated with the CleanEval benchmark and with randomly selected pages from well-known websites, where various web domains and styles are tested. The average F1-scores with our method were 8.79% higher than the best scores among several alternative methods.

源语言英语
主期刊名SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
出版商Association for Computing Machinery
245-254
页数10
ISBN(印刷版)9781450309349
DOI
出版状态已出版 - 2011
活动34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011 - Beijing, 中国
期限: 24 7月 201128 7月 2011

出版系列

姓名SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

会议

会议34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011
国家/地区中国
Beijing
时期24/07/1128/07/11

指纹

探究 'DOM based content extraction via text density' 的科研主题。它们共同构成独一无二的指纹。

引用此