Two-stage Web record extraction

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

To extract structured data from the Web is a challenging subtask of information extraction. In the Web, the structured data are usually presented as lists, records or tables. Present methods require to identity the boundaries of data regions before separating them into records. Because records do not always have the same count of items or occur in consecutive sections, these methods often fail to handle such complicated or noisy pages. In this paper, we propose a fully automatic method called Two-Stage Web Record Extraction (TSWRE) to extract records from an open domain corpus. This approach uses a bottom-up analysis that starts with sequences of visually similar attribute sequences. It first identifies attribute sequences based on distinct tag paths of the ordered DOM tree of the document. The method exploits the position interleave characteristics of the attribute sequences to estimate how likely the sequences belong to the same records. Empirical experiments show that our method achieves promising performance compared to existing methods and is scalable to a large corpus.

Original languageEnglish
Title of host publicationProceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013
Pages783-788
Number of pages6
DOIs
Publication statusPublished - 2013
Event8th International Conference on Computer Science and Education, ICCSE 2013 - Colombo, Sri Lanka
Duration: 26 Aug 201328 Aug 2013

Publication series

NameProceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013

Conference

Conference8th International Conference on Computer Science and Education, ICCSE 2013
Country/TerritorySri Lanka
CityColombo
Period26/08/1328/08/13

Keywords

  • Information extraction
  • data record extraction
  • multiple sequence alignment

Fingerprint

Dive into the research topics of 'Two-stage Web record extraction'. Together they form a unique fingerprint.

Cite this