Content extraction from chinese web pages based on punctuations distribution

Qian Peng; Qinglin Wang; Yuan Li; Jixian Zhang; Yuexing Hao

doi:10.1109/CSSS.2012.341

Content extraction from chinese web pages based on punctuations distribution

Qian Peng^*, Qinglin Wang, Yuan Li, Jixian Zhang, Yuexing Hao

^*此作品的通讯作者

自动化学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

2 引用（Scopus）

摘要

Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.

源语言	英语
主期刊名	Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012
页	1351-1355
页数	5
DOI	https://doi.org/10.1109/CSSS.2012.341
出版状态	已出版 - 2012
活动	2012 International Conference on Computer Science and Service System, CSSS 2012 - Nanjing, 中国期限: 11 8月 2012 → 13 8月 2012

出版系列

姓名	Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012

会议

会议	2012 International Conference on Computer Science and Service System, CSSS 2012
国家/地区	中国
市	Nanjing
时期	11/08/12 → 13/08/12

访问文件

10.1109/CSSS.2012.341

其它文件与链接

链接到 Scopus 的出版物

引用此

Peng, Q., Wang, Q., Li, Y., Zhang, J., & Hao, Y. (2012). Content extraction from chinese web pages based on punctuations distribution. 在 Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012 (页码 1351-1355). 文章 6394579 (Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012). https://doi.org/10.1109/CSSS.2012.341

@inproceedings{71f448b8124b41edb033d2e51d5d90d5,

title = "Content extraction from chinese web pages based on punctuations distribution",

abstract = "Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.",

keywords = "content extraction, kernel punctuation, punctuation distruction",

author = "Qian Peng and Qinglin Wang and Yuan Li and Jixian Zhang and Yuexing Hao",

year = "2012",

doi = "10.1109/CSSS.2012.341",

language = "English",

isbn = "9780769547190",

series = "Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012",

pages = "1351--1355",

booktitle = "Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012",

note = "2012 International Conference on Computer Science and Service System, CSSS 2012 ; Conference date: 11-08-2012 Through 13-08-2012",

}

Peng, Q, Wang, Q, Li, Y, Zhang, J & Hao, Y 2012, Content extraction from chinese web pages based on punctuations distribution. 在 Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012., 6394579, Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012, 页码 1351-1355, 2012 International Conference on Computer Science and Service System, CSSS 2012, Nanjing, 中国, 11/08/12. https://doi.org/10.1109/CSSS.2012.341

Content extraction from chinese web pages based on punctuations distribution. / Peng, Qian; Wang, Qinglin; Li, Yuan 等.
Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012. 2012. 页码 1351-1355 6394579 (Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Content extraction from chinese web pages based on punctuations distribution

AU - Peng, Qian

AU - Wang, Qinglin

AU - Li, Yuan

AU - Zhang, Jixian

AU - Hao, Yuexing

PY - 2012

Y1 - 2012

N2 - Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.

AB - Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.

KW - content extraction

KW - kernel punctuation

KW - punctuation distruction

UR - http://www.scopus.com/inward/record.url?scp=84873851577&partnerID=8YFLogxK

U2 - 10.1109/CSSS.2012.341

DO - 10.1109/CSSS.2012.341

M3 - Conference contribution

AN - SCOPUS:84873851577

SN - 9780769547190

T3 - Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012

SP - 1351

EP - 1355

BT - Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012

T2 - 2012 International Conference on Computer Science and Service System, CSSS 2012

Y2 - 11 August 2012 through 13 August 2012

ER -

Peng Q, Wang Q, Li Y, Zhang J, Hao Y. Content extraction from chinese web pages based on punctuations distribution. 在 Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012. 2012. 页码 1351-1355. 6394579. (Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012). doi: 10.1109/CSSS.2012.341

Content extraction from chinese web pages based on punctuations distribution

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此