TY - GEN
T1 - A dynamic learning framework to thoroughly extract structured data from web pages without human efforts
AU - Song, Dandan
AU - Wu, Yunpeng
AU - Liao, Lejian
AU - Li, Long
AU - Sun, Fei
PY - 2012
Y1 - 2012
N2 - Tremendous concrete and comprehensive information is contained in structured data of web pages. Attributes and their corresponding values of entities are precious resources for automatic semantic annotation, knowledge discovery, and information utilization. However, various displaying styles and formats of web pages make it a challenging task to extract them. Based on our observation, despite the lack of information in a single page, different web pages and different web sites illustrating similar entities can provide adequate knowledge for computers to learn. This paper presents a dynamic learning framework to effectively extract structured information from enormous websites in various verticals (e.g., books, cameras, jobs). Different with other existing approaches that are static, require manually labeling samples and can not be flexible to unseen attributes, our approach aims at dynamically, automatically and thoroughly extracting structured data from web pages. Experiments with totally 17,850 web pages in 4 verticals demonstrated the effectiveness of our framework.
AB - Tremendous concrete and comprehensive information is contained in structured data of web pages. Attributes and their corresponding values of entities are precious resources for automatic semantic annotation, knowledge discovery, and information utilization. However, various displaying styles and formats of web pages make it a challenging task to extract them. Based on our observation, despite the lack of information in a single page, different web pages and different web sites illustrating similar entities can provide adequate knowledge for computers to learn. This paper presents a dynamic learning framework to effectively extract structured information from enormous websites in various verticals (e.g., books, cameras, jobs). Different with other existing approaches that are static, require manually labeling samples and can not be flexible to unseen attributes, our approach aims at dynamically, automatically and thoroughly extracting structured data from web pages. Experiments with totally 17,850 web pages in 4 verticals demonstrated the effectiveness of our framework.
KW - Information extraction
KW - Learning framework
KW - Structured data
UR - http://www.scopus.com/inward/record.url?scp=84866610336&partnerID=8YFLogxK
U2 - 10.1145/2350190.2350199
DO - 10.1145/2350190.2350199
M3 - Conference contribution
AN - SCOPUS:84866610336
SN - 9781450315463
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
BT - Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics 2012, MDS'12 - SIGKDD 2012
T2 - ACM SIGKDD Workshop on Mining Data Semantics 2012, MDS'12 - SIGKDD 2012
Y2 - 12 August 2012 through 16 August 2012
ER -