Deep web databases sampling approach based on probability selection and rule mining

Yang Xu*, Shu Liang Wang, Jian Wei Tian

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

A great portion of data on the Web lies in the hidden databases of the Deep Web. These databases can only be accessed through the query interfaces. The data information in these databases can only be obtained by data sampling. Efficient and uniform data sampling approach is very important to other research work, such as data source selection and ranking, for the data samples can give insight into the data quality, freshness and coverage information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability selection and rule mining based sampling approach to solve this problem. First, we leverage the historical valid walks to calculate the valid probability of the attribute values. Based on the valid probability, we give priority to sample using the attribute values with largest valid probability and guide the sampler to find the valid sampling path earlier. Meanwhile, we save the underflow walk path to mine the underflow rules, which are used in the sampling process to guide the sampler to avoid the underflow walks. The experimental results indicate that our approach can improve the sampling efficiency by detecting the valid path earlier and avoid many underflow queries.

Original languageEnglish
Title of host publicationProceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009
DOIs
Publication statusPublished - 2009
Externally publishedYes
Event2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009 - Wuhan, China
Duration: 11 Dec 200913 Dec 2009

Publication series

NameProceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009

Conference

Conference2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009
Country/TerritoryChina
CityWuhan
Period11/12/0913/12/09

Keywords

  • Data sampling
  • Deep web
  • Probability selection
  • Rule mining

Fingerprint

Dive into the research topics of 'Deep web databases sampling approach based on probability selection and rule mining'. Together they form a unique fingerprint.

Cite this