Content extraction from chinese web pages based on punctuations distribution

Qian Peng*, Qinglin Wang, Yuan Li, Jixian Zhang, Yuexing Hao

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

2 引用 (Scopus)

摘要

Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.

源语言英语
主期刊名Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012
1351-1355
页数5
DOI
出版状态已出版 - 2012
活动2012 International Conference on Computer Science and Service System, CSSS 2012 - Nanjing, 中国
期限: 11 8月 201213 8月 2012

出版系列

姓名Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012

会议

会议2012 International Conference on Computer Science and Service System, CSSS 2012
国家/地区中国
Nanjing
时期11/08/1213/08/12

指纹

探究 'Content extraction from chinese web pages based on punctuations distribution' 的科研主题。它们共同构成独一无二的指纹。

引用此