Deep web databases sampling approach based on probability selection and rule mining

Yang Xu*, Shu Liang Wang, Jian Wei Tian

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

1 引用 (Scopus)

摘要

A great portion of data on the Web lies in the hidden databases of the Deep Web. These databases can only be accessed through the query interfaces. The data information in these databases can only be obtained by data sampling. Efficient and uniform data sampling approach is very important to other research work, such as data source selection and ranking, for the data samples can give insight into the data quality, freshness and coverage information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability selection and rule mining based sampling approach to solve this problem. First, we leverage the historical valid walks to calculate the valid probability of the attribute values. Based on the valid probability, we give priority to sample using the attribute values with largest valid probability and guide the sampler to find the valid sampling path earlier. Meanwhile, we save the underflow walk path to mine the underflow rules, which are used in the sampling process to guide the sampler to avoid the underflow walks. The experimental results indicate that our approach can improve the sampling efficiency by detecting the valid path earlier and avoid many underflow queries.

源语言英语
主期刊名Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009
DOI
出版状态已出版 - 2009
已对外发布
活动2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009 - Wuhan, 中国
期限: 11 12月 200913 12月 2009

出版系列

姓名Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009

会议

会议2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009
国家/地区中国
Wuhan
时期11/12/0913/12/09

指纹

探究 'Deep web databases sampling approach based on probability selection and rule mining' 的科研主题。它们共同构成独一无二的指纹。

引用此