TY - GEN
T1 - Two-stage Web record extraction
AU - Yang, Qing
AU - Zhang, Chunxia
AU - Niu, Zhendong
PY - 2013
Y1 - 2013
N2 - To extract structured data from the Web is a challenging subtask of information extraction. In the Web, the structured data are usually presented as lists, records or tables. Present methods require to identity the boundaries of data regions before separating them into records. Because records do not always have the same count of items or occur in consecutive sections, these methods often fail to handle such complicated or noisy pages. In this paper, we propose a fully automatic method called Two-Stage Web Record Extraction (TSWRE) to extract records from an open domain corpus. This approach uses a bottom-up analysis that starts with sequences of visually similar attribute sequences. It first identifies attribute sequences based on distinct tag paths of the ordered DOM tree of the document. The method exploits the position interleave characteristics of the attribute sequences to estimate how likely the sequences belong to the same records. Empirical experiments show that our method achieves promising performance compared to existing methods and is scalable to a large corpus.
AB - To extract structured data from the Web is a challenging subtask of information extraction. In the Web, the structured data are usually presented as lists, records or tables. Present methods require to identity the boundaries of data regions before separating them into records. Because records do not always have the same count of items or occur in consecutive sections, these methods often fail to handle such complicated or noisy pages. In this paper, we propose a fully automatic method called Two-Stage Web Record Extraction (TSWRE) to extract records from an open domain corpus. This approach uses a bottom-up analysis that starts with sequences of visually similar attribute sequences. It first identifies attribute sequences based on distinct tag paths of the ordered DOM tree of the document. The method exploits the position interleave characteristics of the attribute sequences to estimate how likely the sequences belong to the same records. Empirical experiments show that our method achieves promising performance compared to existing methods and is scalable to a large corpus.
KW - Information extraction
KW - data record extraction
KW - multiple sequence alignment
UR - http://www.scopus.com/inward/record.url?scp=84881528256&partnerID=8YFLogxK
U2 - 10.1109/ICCSE.2013.6554015
DO - 10.1109/ICCSE.2013.6554015
M3 - Conference contribution
AN - SCOPUS:84881528256
SN - 9781467344623
T3 - Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013
SP - 783
EP - 788
BT - Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013
T2 - 8th International Conference on Computer Science and Education, ICCSE 2013
Y2 - 26 August 2013 through 28 August 2013
ER -