TY - GEN
T1 - Deep web databases sampling approach based on probability selection and rule mining
AU - Xu, Yang
AU - Wang, Shu Liang
AU - Tian, Jian Wei
PY - 2009
Y1 - 2009
N2 - A great portion of data on the Web lies in the hidden databases of the Deep Web. These databases can only be accessed through the query interfaces. The data information in these databases can only be obtained by data sampling. Efficient and uniform data sampling approach is very important to other research work, such as data source selection and ranking, for the data samples can give insight into the data quality, freshness and coverage information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability selection and rule mining based sampling approach to solve this problem. First, we leverage the historical valid walks to calculate the valid probability of the attribute values. Based on the valid probability, we give priority to sample using the attribute values with largest valid probability and guide the sampler to find the valid sampling path earlier. Meanwhile, we save the underflow walk path to mine the underflow rules, which are used in the sampling process to guide the sampler to avoid the underflow walks. The experimental results indicate that our approach can improve the sampling efficiency by detecting the valid path earlier and avoid many underflow queries.
AB - A great portion of data on the Web lies in the hidden databases of the Deep Web. These databases can only be accessed through the query interfaces. The data information in these databases can only be obtained by data sampling. Efficient and uniform data sampling approach is very important to other research work, such as data source selection and ranking, for the data samples can give insight into the data quality, freshness and coverage information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability selection and rule mining based sampling approach to solve this problem. First, we leverage the historical valid walks to calculate the valid probability of the attribute values. Based on the valid probability, we give priority to sample using the attribute values with largest valid probability and guide the sampler to find the valid sampling path earlier. Meanwhile, we save the underflow walk path to mine the underflow rules, which are used in the sampling process to guide the sampler to avoid the underflow walks. The experimental results indicate that our approach can improve the sampling efficiency by detecting the valid path earlier and avoid many underflow queries.
KW - Data sampling
KW - Deep web
KW - Probability selection
KW - Rule mining
UR - http://www.scopus.com/inward/record.url?scp=77949678821&partnerID=8YFLogxK
U2 - 10.1109/CISE.2009.5362897
DO - 10.1109/CISE.2009.5362897
M3 - Conference contribution
AN - SCOPUS:77949678821
SN - 9781424445073
T3 - Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009
BT - Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009
T2 - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009
Y2 - 11 December 2009 through 13 December 2009
ER -