Combining a segmentation-like approach and a density-based approach in content extraction

Shuang Lin; Jie Chen; Zhendong Niu

doi:10.1109/TST.2012.6216755

Combining a segmentation-like approach and a density-based approach in content extraction

Shuang Lin, Jie Chen, Zhendong Niu^*

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

5 引用（Scopus）

摘要

Density-based approaches in content extraction, whose task is to extract contents from Web pages, are commonly used to obtain page contents that are critical to many Web mining applications. However, traditional density-based approaches cannot effectively manage pages that contain short contents and long noises. To overcome this problem, in this paper, we propose a content extraction approach for obtaining content from news pages that combines a segmentation-like approach and a density-based approach. A tool called BlockExtractor was developed based on this approach. BlockExtractor identifies contents in three steps. First, it looks for all Block-Level Elements (BLE) & Inline Elements (IE) blocks, which are designed to roughly segment pages into blocks. Second, it computes the densities of each BLE&IE block and its element to eliminate noises. Third, it removes all redundant BLEIE blocks that have emerged in other pages from the same site. Compared with three other density-based approaches, our approach shows significant advantages in both precision and recall.

源语言	英语
文章编号	6216755
页（从-至）	256-264
页数	9
期刊	Tsinghua Science and Technology
卷	17
期	3
DOI	https://doi.org/10.1109/TST.2012.6216755
出版状态	已出版 - 2012

访问文件

10.1109/TST.2012.6216755

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{619a10b700d14974a28aef3921381f23,

title = "Combining a segmentation-like approach and a density-based approach in content extraction",

abstract = "Density-based approaches in content extraction, whose task is to extract contents from Web pages, are commonly used to obtain page contents that are critical to many Web mining applications. However, traditional density-based approaches cannot effectively manage pages that contain short contents and long noises. To overcome this problem, in this paper, we propose a content extraction approach for obtaining content from news pages that combines a segmentation-like approach and a density-based approach. A tool called BlockExtractor was developed based on this approach. BlockExtractor identifies contents in three steps. First, it looks for all Block-Level Elements (BLE) & Inline Elements (IE) blocks, which are designed to roughly segment pages into blocks. Second, it computes the densities of each BLE&IE block and its element to eliminate noises. Third, it removes all redundant BLEIE blocks that have emerged in other pages from the same site. Compared with three other density-based approaches, our approach shows significant advantages in both precision and recall.",

keywords = "content extraction, density-based approach, segmentation",

author = "Shuang Lin and Jie Chen and Zhendong Niu",

year = "2012",

doi = "10.1109/TST.2012.6216755",

language = "English",

volume = "17",

pages = "256--264",

journal = "Tsinghua Science and Technology",

issn = "1007-0214",

publisher = "Tsinghua University",

number = "3",

}

TY - JOUR

T1 - Combining a segmentation-like approach and a density-based approach in content extraction

AU - Lin, Shuang

AU - Chen, Jie

AU - Niu, Zhendong

PY - 2012

Y1 - 2012

N2 - Density-based approaches in content extraction, whose task is to extract contents from Web pages, are commonly used to obtain page contents that are critical to many Web mining applications. However, traditional density-based approaches cannot effectively manage pages that contain short contents and long noises. To overcome this problem, in this paper, we propose a content extraction approach for obtaining content from news pages that combines a segmentation-like approach and a density-based approach. A tool called BlockExtractor was developed based on this approach. BlockExtractor identifies contents in three steps. First, it looks for all Block-Level Elements (BLE) & Inline Elements (IE) blocks, which are designed to roughly segment pages into blocks. Second, it computes the densities of each BLE&IE block and its element to eliminate noises. Third, it removes all redundant BLEIE blocks that have emerged in other pages from the same site. Compared with three other density-based approaches, our approach shows significant advantages in both precision and recall.

AB - Density-based approaches in content extraction, whose task is to extract contents from Web pages, are commonly used to obtain page contents that are critical to many Web mining applications. However, traditional density-based approaches cannot effectively manage pages that contain short contents and long noises. To overcome this problem, in this paper, we propose a content extraction approach for obtaining content from news pages that combines a segmentation-like approach and a density-based approach. A tool called BlockExtractor was developed based on this approach. BlockExtractor identifies contents in three steps. First, it looks for all Block-Level Elements (BLE) & Inline Elements (IE) blocks, which are designed to roughly segment pages into blocks. Second, it computes the densities of each BLE&IE block and its element to eliminate noises. Third, it removes all redundant BLEIE blocks that have emerged in other pages from the same site. Compared with three other density-based approaches, our approach shows significant advantages in both precision and recall.

KW - content extraction

KW - density-based approach

KW - segmentation

UR - http://www.scopus.com/inward/record.url?scp=84864536493&partnerID=8YFLogxK

U2 - 10.1109/TST.2012.6216755

DO - 10.1109/TST.2012.6216755

M3 - Article

AN - SCOPUS:84864536493

SN - 1007-0214

VL - 17

SP - 256

EP - 264

JO - Tsinghua Science and Technology

JF - Tsinghua Science and Technology

IS - 3

M1 - 6216755

ER -

Combining a segmentation-like approach and a density-based approach in content extraction

摘要

访问文件

其它文件与链接

指纹

引用此