TY - GEN
T1 - Vertical classification of web pages for structured data extraction
AU - Li, Long
AU - Song, Dandan
AU - Liao, Lejian
PY - 2012
Y1 - 2012
N2 - We propose a general hierarchical vertical classification framework, which can automatically discover the inherent hierarchical structure of relationships among verticals based on flat datasets, and then build a hierarchical classifier. We conducted a set of comparison experiments to verify the performance of it, such as with flat vs hierarchical structure of relationships, as well as among different feature selection and classification methods. Experimental results show that the hierarchical classifiers built on the basis of the proposed framework make big improvements over the flat classifiers when classifying unseen web pages. Among them, the Support Vector Machine using Odds Ratio to select discriminative features performs best.
AB - We propose a general hierarchical vertical classification framework, which can automatically discover the inherent hierarchical structure of relationships among verticals based on flat datasets, and then build a hierarchical classifier. We conducted a set of comparison experiments to verify the performance of it, such as with flat vs hierarchical structure of relationships, as well as among different feature selection and classification methods. Experimental results show that the hierarchical classifiers built on the basis of the proposed framework make big improvements over the flat classifiers when classifying unseen web pages. Among them, the Support Vector Machine using Odds Ratio to select discriminative features performs best.
KW - Automatic hierarchy
KW - Hierarchical classifiers
KW - Structured data extracting
KW - Vertical classification
UR - http://www.scopus.com/inward/record.url?scp=84871590562&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-35341-3_44
DO - 10.1007/978-3-642-35341-3_44
M3 - Conference contribution
AN - SCOPUS:84871590562
SN - 9783642353406
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 486
EP - 495
BT - Information Retrieval Technology - 8th Asia Information Retrieval Societies Conference, AIRS 2012, Proceedings
T2 - 8th Asia Information Retrieval Societies Conference, AIRS 2012
Y2 - 17 December 2012 through 19 December 2012
ER -