Deep web databases sampling approach based on probability selection and rule mining

Yang Xu; Shu Liang Wang; Jian Wei Tian

doi:10.1109/CISE.2009.5362897

Deep web databases sampling approach based on probability selection and rule mining

Yang Xu^*, Shu Liang Wang, Jian Wei Tian

^*此作品的通讯作者

Wuhan University

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

1 引用（Scopus）

摘要

A great portion of data on the Web lies in the hidden databases of the Deep Web. These databases can only be accessed through the query interfaces. The data information in these databases can only be obtained by data sampling. Efficient and uniform data sampling approach is very important to other research work, such as data source selection and ranking, for the data samples can give insight into the data quality, freshness and coverage information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability selection and rule mining based sampling approach to solve this problem. First, we leverage the historical valid walks to calculate the valid probability of the attribute values. Based on the valid probability, we give priority to sample using the attribute values with largest valid probability and guide the sampler to find the valid sampling path earlier. Meanwhile, we save the underflow walk path to mine the underflow rules, which are used in the sampling process to guide the sampler to avoid the underflow walks. The experimental results indicate that our approach can improve the sampling efficiency by detecting the valid path earlier and avoid many underflow queries.

源语言	英语
主期刊名	Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009
DOI	https://doi.org/10.1109/CISE.2009.5362897
出版状态	已出版 - 2009
已对外发布	是
活动	2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009 - Wuhan, 中国期限: 11 12月 2009 → 13 12月 2009

出版系列

姓名	Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009

会议

会议	2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009
国家/地区	中国
市	Wuhan
时期	11/12/09 → 13/12/09

访问文件

10.1109/CISE.2009.5362897

其它文件与链接

链接到 Scopus 的出版物

引用此

Xu, Y., Wang, S. L., & Tian, J. W. (2009). Deep web databases sampling approach based on probability selection and rule mining. 在 Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009 文章 5362897 (Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009). https://doi.org/10.1109/CISE.2009.5362897

@inproceedings{f15ddd4117194efba76b7d01dd0a27ba,

title = "Deep web databases sampling approach based on probability selection and rule mining",

abstract = "A great portion of data on the Web lies in the hidden databases of the Deep Web. These databases can only be accessed through the query interfaces. The data information in these databases can only be obtained by data sampling. Efficient and uniform data sampling approach is very important to other research work, such as data source selection and ranking, for the data samples can give insight into the data quality, freshness and coverage information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability selection and rule mining based sampling approach to solve this problem. First, we leverage the historical valid walks to calculate the valid probability of the attribute values. Based on the valid probability, we give priority to sample using the attribute values with largest valid probability and guide the sampler to find the valid sampling path earlier. Meanwhile, we save the underflow walk path to mine the underflow rules, which are used in the sampling process to guide the sampler to avoid the underflow walks. The experimental results indicate that our approach can improve the sampling efficiency by detecting the valid path earlier and avoid many underflow queries.",

keywords = "Data sampling, Deep web, Probability selection, Rule mining",

author = "Yang Xu and Wang, {Shu Liang} and Tian, {Jian Wei}",

year = "2009",

doi = "10.1109/CISE.2009.5362897",

language = "English",

isbn = "9781424445073",

series = "Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009",

booktitle = "Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009",

note = "2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009 ; Conference date: 11-12-2009 Through 13-12-2009",

}

Xu, Y, Wang, SL & Tian, JW 2009, Deep web databases sampling approach based on probability selection and rule mining. 在 Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009., 5362897, Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009, 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009, Wuhan, 中国, 11/12/09. https://doi.org/10.1109/CISE.2009.5362897

Deep web databases sampling approach based on probability selection and rule mining. / Xu, Yang; Wang, Shu Liang; Tian, Jian Wei.
Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009. 2009. 5362897 (Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Deep web databases sampling approach based on probability selection and rule mining

AU - Xu, Yang

AU - Wang, Shu Liang

AU - Tian, Jian Wei

PY - 2009

Y1 - 2009

N2 - A great portion of data on the Web lies in the hidden databases of the Deep Web. These databases can only be accessed through the query interfaces. The data information in these databases can only be obtained by data sampling. Efficient and uniform data sampling approach is very important to other research work, such as data source selection and ranking, for the data samples can give insight into the data quality, freshness and coverage information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability selection and rule mining based sampling approach to solve this problem. First, we leverage the historical valid walks to calculate the valid probability of the attribute values. Based on the valid probability, we give priority to sample using the attribute values with largest valid probability and guide the sampler to find the valid sampling path earlier. Meanwhile, we save the underflow walk path to mine the underflow rules, which are used in the sampling process to guide the sampler to avoid the underflow walks. The experimental results indicate that our approach can improve the sampling efficiency by detecting the valid path earlier and avoid many underflow queries.

AB - A great portion of data on the Web lies in the hidden databases of the Deep Web. These databases can only be accessed through the query interfaces. The data information in these databases can only be obtained by data sampling. Efficient and uniform data sampling approach is very important to other research work, such as data source selection and ranking, for the data samples can give insight into the data quality, freshness and coverage information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability selection and rule mining based sampling approach to solve this problem. First, we leverage the historical valid walks to calculate the valid probability of the attribute values. Based on the valid probability, we give priority to sample using the attribute values with largest valid probability and guide the sampler to find the valid sampling path earlier. Meanwhile, we save the underflow walk path to mine the underflow rules, which are used in the sampling process to guide the sampler to avoid the underflow walks. The experimental results indicate that our approach can improve the sampling efficiency by detecting the valid path earlier and avoid many underflow queries.

KW - Data sampling

KW - Deep web

KW - Probability selection

KW - Rule mining

UR - http://www.scopus.com/inward/record.url?scp=77949678821&partnerID=8YFLogxK

U2 - 10.1109/CISE.2009.5362897

DO - 10.1109/CISE.2009.5362897

M3 - Conference contribution

AN - SCOPUS:77949678821

SN - 9781424445073

T3 - Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009

BT - Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009

T2 - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009

Y2 - 11 December 2009 through 13 December 2009

ER -

Xu Y, Wang SL, Tian JW. Deep web databases sampling approach based on probability selection and rule mining. 在 Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009. 2009. 5362897. (Proceedings - 2009 International Conference on Computational Intelligence and Software Engineering, CiSE 2009). doi: 10.1109/CISE.2009.5362897

Deep web databases sampling approach based on probability selection and rule mining

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此