Two-stage Web record extraction

Qing Yang; Chunxia Zhang; Zhendong Niu

doi:10.1109/ICCSE.2013.6554015

Two-stage Web record extraction

Qing Yang, Chunxia Zhang, Zhendong Niu

计算机学院

Beijing Institute of Technology

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

To extract structured data from the Web is a challenging subtask of information extraction. In the Web, the structured data are usually presented as lists, records or tables. Present methods require to identity the boundaries of data regions before separating them into records. Because records do not always have the same count of items or occur in consecutive sections, these methods often fail to handle such complicated or noisy pages. In this paper, we propose a fully automatic method called Two-Stage Web Record Extraction (TSWRE) to extract records from an open domain corpus. This approach uses a bottom-up analysis that starts with sequences of visually similar attribute sequences. It first identifies attribute sequences based on distinct tag paths of the ordered DOM tree of the document. The method exploits the position interleave characteristics of the attribute sequences to estimate how likely the sequences belong to the same records. Empirical experiments show that our method achieves promising performance compared to existing methods and is scalable to a large corpus.

源语言	英语
主期刊名	Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013
页	783-788
页数	6
DOI	https://doi.org/10.1109/ICCSE.2013.6554015
出版状态	已出版 - 2013
活动	8th International Conference on Computer Science and Education, ICCSE 2013 - Colombo, 斯里兰卡期限: 26 8月 2013 → 28 8月 2013

出版系列

姓名	Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013

会议

会议	8th International Conference on Computer Science and Education, ICCSE 2013
国家/地区	斯里兰卡
市	Colombo
时期	26/08/13 → 28/08/13

访问文件

10.1109/ICCSE.2013.6554015

其它文件与链接

链接到 Scopus 的出版物

引用此

@inproceedings{b229a50e1f9d459cbef55bfb7b5a4a2e,

title = "Two-stage Web record extraction",

abstract = "To extract structured data from the Web is a challenging subtask of information extraction. In the Web, the structured data are usually presented as lists, records or tables. Present methods require to identity the boundaries of data regions before separating them into records. Because records do not always have the same count of items or occur in consecutive sections, these methods often fail to handle such complicated or noisy pages. In this paper, we propose a fully automatic method called Two-Stage Web Record Extraction (TSWRE) to extract records from an open domain corpus. This approach uses a bottom-up analysis that starts with sequences of visually similar attribute sequences. It first identifies attribute sequences based on distinct tag paths of the ordered DOM tree of the document. The method exploits the position interleave characteristics of the attribute sequences to estimate how likely the sequences belong to the same records. Empirical experiments show that our method achieves promising performance compared to existing methods and is scalable to a large corpus.",

keywords = "Information extraction, data record extraction, multiple sequence alignment",

author = "Qing Yang and Chunxia Zhang and Zhendong Niu",

year = "2013",

doi = "10.1109/ICCSE.2013.6554015",

language = "English",

isbn = "9781467344623",

series = "Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013",

pages = "783--788",

booktitle = "Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013",

note = "8th International Conference on Computer Science and Education, ICCSE 2013 ; Conference date: 26-08-2013 Through 28-08-2013",

}

Yang, Q, Zhang, C & Niu, Z 2013, Two-stage Web record extraction. 在 Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013., 6554015, Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013, 页码 783-788, 8th International Conference on Computer Science and Education, ICCSE 2013, Colombo, 斯里兰卡, 26/08/13. https://doi.org/10.1109/ICCSE.2013.6554015

Two-stage Web record extraction. / Yang, Qing; Zhang, Chunxia ; Niu, Zhendong.
Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013. 2013. 页码 783-788 6554015 (Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Two-stage Web record extraction

AU - Yang, Qing

AU - Zhang, Chunxia

AU - Niu, Zhendong

PY - 2013

Y1 - 2013

N2 - To extract structured data from the Web is a challenging subtask of information extraction. In the Web, the structured data are usually presented as lists, records or tables. Present methods require to identity the boundaries of data regions before separating them into records. Because records do not always have the same count of items or occur in consecutive sections, these methods often fail to handle such complicated or noisy pages. In this paper, we propose a fully automatic method called Two-Stage Web Record Extraction (TSWRE) to extract records from an open domain corpus. This approach uses a bottom-up analysis that starts with sequences of visually similar attribute sequences. It first identifies attribute sequences based on distinct tag paths of the ordered DOM tree of the document. The method exploits the position interleave characteristics of the attribute sequences to estimate how likely the sequences belong to the same records. Empirical experiments show that our method achieves promising performance compared to existing methods and is scalable to a large corpus.

AB - To extract structured data from the Web is a challenging subtask of information extraction. In the Web, the structured data are usually presented as lists, records or tables. Present methods require to identity the boundaries of data regions before separating them into records. Because records do not always have the same count of items or occur in consecutive sections, these methods often fail to handle such complicated or noisy pages. In this paper, we propose a fully automatic method called Two-Stage Web Record Extraction (TSWRE) to extract records from an open domain corpus. This approach uses a bottom-up analysis that starts with sequences of visually similar attribute sequences. It first identifies attribute sequences based on distinct tag paths of the ordered DOM tree of the document. The method exploits the position interleave characteristics of the attribute sequences to estimate how likely the sequences belong to the same records. Empirical experiments show that our method achieves promising performance compared to existing methods and is scalable to a large corpus.

KW - Information extraction

KW - data record extraction

KW - multiple sequence alignment

UR - http://www.scopus.com/inward/record.url?scp=84881528256&partnerID=8YFLogxK

U2 - 10.1109/ICCSE.2013.6554015

DO - 10.1109/ICCSE.2013.6554015

M3 - Conference contribution

AN - SCOPUS:84881528256

SN - 9781467344623

T3 - Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013

SP - 783

EP - 788

BT - Proceedings of the 8th International Conference on Computer Science and Education, ICCSE 2013

T2 - 8th International Conference on Computer Science and Education, ICCSE 2013

Y2 - 26 August 2013 through 28 August 2013

ER -

Two-stage Web record extraction

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此