TY - JOUR
T1 - Combining a segmentation-like approach and a density-based approach in content extraction
AU - Lin, Shuang
AU - Chen, Jie
AU - Niu, Zhendong
PY - 2012
Y1 - 2012
N2 - Density-based approaches in content extraction, whose task is to extract contents from Web pages, are commonly used to obtain page contents that are critical to many Web mining applications. However, traditional density-based approaches cannot effectively manage pages that contain short contents and long noises. To overcome this problem, in this paper, we propose a content extraction approach for obtaining content from news pages that combines a segmentation-like approach and a density-based approach. A tool called BlockExtractor was developed based on this approach. BlockExtractor identifies contents in three steps. First, it looks for all Block-Level Elements (BLE) & Inline Elements (IE) blocks, which are designed to roughly segment pages into blocks. Second, it computes the densities of each BLE&IE block and its element to eliminate noises. Third, it removes all redundant BLEIE blocks that have emerged in other pages from the same site. Compared with three other density-based approaches, our approach shows significant advantages in both precision and recall.
AB - Density-based approaches in content extraction, whose task is to extract contents from Web pages, are commonly used to obtain page contents that are critical to many Web mining applications. However, traditional density-based approaches cannot effectively manage pages that contain short contents and long noises. To overcome this problem, in this paper, we propose a content extraction approach for obtaining content from news pages that combines a segmentation-like approach and a density-based approach. A tool called BlockExtractor was developed based on this approach. BlockExtractor identifies contents in three steps. First, it looks for all Block-Level Elements (BLE) & Inline Elements (IE) blocks, which are designed to roughly segment pages into blocks. Second, it computes the densities of each BLE&IE block and its element to eliminate noises. Third, it removes all redundant BLEIE blocks that have emerged in other pages from the same site. Compared with three other density-based approaches, our approach shows significant advantages in both precision and recall.
KW - content extraction
KW - density-based approach
KW - segmentation
UR - http://www.scopus.com/inward/record.url?scp=84864536493&partnerID=8YFLogxK
U2 - 10.1109/TST.2012.6216755
DO - 10.1109/TST.2012.6216755
M3 - Article
AN - SCOPUS:84864536493
SN - 1007-0214
VL - 17
SP - 256
EP - 264
JO - Tsinghua Science and Technology
JF - Tsinghua Science and Technology
IS - 3
M1 - 6216755
ER -