Content extraction from chinese web pages based on punctuations distribution

Qian Peng; Qinglin Wang; Yuan Li; Jixian Zhang; Yuexing Hao

doi:10.1109/CSSS.2012.341

Content extraction from chinese web pages based on punctuations distribution

Qian Peng^*, Qinglin Wang, Yuan Li, Jixian Zhang, Yuexing Hao

^*Corresponding author for this work

School of Automation

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.

Original language	English
Title of host publication	Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012
Pages	1351-1355
Number of pages	5
DOIs	https://doi.org/10.1109/CSSS.2012.341
Publication status	Published - 2012
Event	2012 International Conference on Computer Science and Service System, CSSS 2012 - Nanjing, China Duration: 11 Aug 2012 → 13 Aug 2012

Publication series

Name	Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012

Conference

Conference	2012 International Conference on Computer Science and Service System, CSSS 2012
Country/Territory	China
City	Nanjing
Period	11/08/12 → 13/08/12

Keywords

content extraction
kernel punctuation
punctuation distruction

Access to Document

10.1109/CSSS.2012.341

Cite this

Peng, Q., Wang, Q., Li, Y., Zhang, J., & Hao, Y. (2012). Content extraction from chinese web pages based on punctuations distribution. In Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012 (pp. 1351-1355). Article 6394579 (Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012). https://doi.org/10.1109/CSSS.2012.341

@inproceedings{71f448b8124b41edb033d2e51d5d90d5,

title = "Content extraction from chinese web pages based on punctuations distribution",

abstract = "Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.",

keywords = "content extraction, kernel punctuation, punctuation distruction",

author = "Qian Peng and Qinglin Wang and Yuan Li and Jixian Zhang and Yuexing Hao",

year = "2012",

doi = "10.1109/CSSS.2012.341",

language = "English",

isbn = "9780769547190",

series = "Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012",

pages = "1351--1355",

booktitle = "Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012",

note = "2012 International Conference on Computer Science and Service System, CSSS 2012 ; Conference date: 11-08-2012 Through 13-08-2012",

}

Peng, Q, Wang, Q, Li, Y, Zhang, J & Hao, Y 2012, Content extraction from chinese web pages based on punctuations distribution. in Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012., 6394579, Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012, pp. 1351-1355, 2012 International Conference on Computer Science and Service System, CSSS 2012, Nanjing, China, 11/08/12. https://doi.org/10.1109/CSSS.2012.341

Content extraction from chinese web pages based on punctuations distribution. / Peng, Qian; Wang, Qinglin; Li, Yuan et al.
Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012. 2012. p. 1351-1355 6394579 (Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Content extraction from chinese web pages based on punctuations distribution

AU - Peng, Qian

AU - Wang, Qinglin

AU - Li, Yuan

AU - Zhang, Jixian

AU - Hao, Yuexing

PY - 2012

Y1 - 2012

N2 - Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.

AB - Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.

KW - content extraction

KW - kernel punctuation

KW - punctuation distruction

UR - http://www.scopus.com/inward/record.url?scp=84873851577&partnerID=8YFLogxK

U2 - 10.1109/CSSS.2012.341

DO - 10.1109/CSSS.2012.341

M3 - Conference contribution

AN - SCOPUS:84873851577

SN - 9780769547190

T3 - Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012

SP - 1351

EP - 1355

BT - Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012

T2 - 2012 International Conference on Computer Science and Service System, CSSS 2012

Y2 - 11 August 2012 through 13 August 2012

ER -

Content extraction from chinese web pages based on punctuations distribution

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this