A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Dandan Song, Yunpeng Wu, Lejian Liao*, Long Li, Fei Sun

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Citations (Scopus)

Abstract

Tremendous concrete and comprehensive information is contained in structured data of web pages. Attributes and their corresponding values of entities are precious resources for automatic semantic annotation, knowledge discovery, and information utilization. However, various displaying styles and formats of web pages make it a challenging task to extract them. Based on our observation, despite the lack of information in a single page, different web pages and different web sites illustrating similar entities can provide adequate knowledge for computers to learn. This paper presents a dynamic learning framework to effectively extract structured information from enormous websites in various verticals (e.g., books, cameras, jobs). Different with other existing approaches that are static, require manually labeling samples and can not be flexible to unseen attributes, our approach aims at dynamically, automatically and thoroughly extracting structured data from web pages. Experiments with totally 17,850 web pages in 4 verticals demonstrated the effectiveness of our framework.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGKDD Workshop on Mining Data Semantics 2012, MDS'12 - SIGKDD 2012
DOIs
Publication statusPublished - 2012
EventACM SIGKDD Workshop on Mining Data Semantics 2012, MDS'12 - SIGKDD 2012 - Beijing, China
Duration: 12 Aug 201216 Aug 2012

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

ConferenceACM SIGKDD Workshop on Mining Data Semantics 2012, MDS'12 - SIGKDD 2012
Country/TerritoryChina
CityBeijing
Period12/08/1216/08/12

Keywords

  • Information extraction
  • Learning framework
  • Structured data

Fingerprint

Dive into the research topics of 'A dynamic learning framework to thoroughly extract structured data from web pages without human efforts'. Together they form a unique fingerprint.

Cite this