Two-stage Web record extraction

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

To extract structured data from the Web is a challenging subtask of information extraction. In the Web, the structured data are usually presented as lists, records or tables. Present methods require to identity the boundaries of data regions before separating them into records. Because records do not always have the same count of items or occur in consecutive sections, these methods often fail to handle such complicated or noisy pages. In this paper, we propose a fully automatic method called Two-Stage Web Record Extraction (TSWRE) to extract records from an open domain corpus. This approach uses a bottom-up analysis that starts with sequences of visually similar attribute sequences. It first identifies attribute sequences based on distinct tag paths of the ordered DOM tree of the document. The method exploits the position interleave characteristics of the attribute sequences to estimate how likely the sequences belong to the same records. Empirical experiments show that our method achieves promising performance compared to existing methods and is scalable to a large corpus.

源语言英语
主期刊名Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013
783-788
页数6
DOI
出版状态已出版 - 2013
活动8th International Conference on Computer Science and Education, ICCSE 2013 - Colombo, 斯里兰卡
期限: 26 8月 201328 8月 2013

出版系列

姓名Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013

会议

会议8th International Conference on Computer Science and Education, ICCSE 2013
国家/地区斯里兰卡
Colombo
时期26/08/1328/08/13

指纹

探究 'Two-stage Web record extraction' 的科研主题。它们共同构成独一无二的指纹。

引用此