Content extraction from chinese web pages based on punctuations distribution

Qian Peng*, Qinglin Wang, Yuan Li, Jixian Zhang, Yuexing Hao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.

Original languageEnglish
Title of host publicationProceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012
Pages1351-1355
Number of pages5
DOIs
Publication statusPublished - 2012
Event2012 International Conference on Computer Science and Service System, CSSS 2012 - Nanjing, China
Duration: 11 Aug 201213 Aug 2012

Publication series

NameProceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012

Conference

Conference2012 International Conference on Computer Science and Service System, CSSS 2012
Country/TerritoryChina
CityNanjing
Period11/08/1213/08/12

Keywords

  • content extraction
  • kernel punctuation
  • punctuation distruction

Fingerprint

Dive into the research topics of 'Content extraction from chinese web pages based on punctuations distribution'. Together they form a unique fingerprint.

Cite this