TY - GEN
T1 - Content extraction from chinese web pages based on punctuations distribution
AU - Peng, Qian
AU - Wang, Qinglin
AU - Li, Yuan
AU - Zhang, Jixian
AU - Hao, Yuexing
PY - 2012
Y1 - 2012
N2 - Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.
AB - Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.
KW - content extraction
KW - kernel punctuation
KW - punctuation distruction
UR - http://www.scopus.com/inward/record.url?scp=84873851577&partnerID=8YFLogxK
U2 - 10.1109/CSSS.2012.341
DO - 10.1109/CSSS.2012.341
M3 - Conference contribution
AN - SCOPUS:84873851577
SN - 9780769547190
T3 - Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012
SP - 1351
EP - 1355
BT - Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012
T2 - 2012 International Conference on Computer Science and Service System, CSSS 2012
Y2 - 11 August 2012 through 13 August 2012
ER -