Manually detecting errors for data cleaning using adaptive crowdsourcing strategies

Haojun Zhang, Chengliang Chai, An Hai Doan, Paraschos Koutris, Esteban Arcaute

科研成果: 书/报告/会议事项章节会议稿件同行评审

3 引用 (Scopus)

摘要

Current work to detect data errors often uses (semi-)automatic solutions. In this paper, however, we argue that there are many real-world scenarios where users have to detect data errors completely manually, and that more attention should be devoted to this problem. We then study one instance of this problem in depth. Specifically, we focus on the problem of manually verifying the values of a target attribute, and shows that the current best solution in industry, which uses crowdsourcing, has significant limitations. We develop a new solution that addresses the above limitations. Our solution can find a much more accurate ranking of the data values in terms of their difficulties for crowdsourcing, can help domain experts debug this ranking, and can handle ambiguous values for which no golden answers exist. Importantly, our solution provides a unified framework that allows users to easily express and solve a broad range of optimization problems for crowdsourcing, to balance between cost and accuracy. Finally, we describe extensive experiments with three real-world data sets that demonstrate the utility and promise of our solution approach.

源语言英语
主期刊名Advances in Database Technology - EDBT 2020
主期刊副标题23rd International Conference on Extending Database Technology, Proceedings
编辑Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Bohm, Dan Olteanu, George Fletcher, Arijit Khan, Bin Yang
出版商OpenProceedings.org
311-322
页数12
ISBN(电子版)9783893180837
DOI
出版状态已出版 - 2020
已对外发布
活动23rd International Conference on Extending Database Technology, EDBT 2020 - Copenhagen, 丹麦
期限: 30 3月 20202 4月 2020

出版系列

姓名Advances in Database Technology - EDBT
2020-March
ISSN(电子版)2367-2005

会议

会议23rd International Conference on Extending Database Technology, EDBT 2020
国家/地区丹麦
Copenhagen
时期30/03/202/04/20

指纹

探究 'Manually detecting errors for data cleaning using adaptive crowdsourcing strategies' 的科研主题。它们共同构成独一无二的指纹。

引用此

Zhang, H., Chai, C., Doan, A. H., Koutris, P., & Arcaute, E. (2020). Manually detecting errors for data cleaning using adaptive crowdsourcing strategies. 在 A. Bonifati, Y. Zhou, M. A. Vaz Salles, A. Bohm, D. Olteanu, G. Fletcher, A. Khan, & B. Yang (编辑), Advances in Database Technology - EDBT 2020: 23rd International Conference on Extending Database Technology, Proceedings (页码 311-322). (Advances in Database Technology - EDBT; 卷 2020-March). OpenProceedings.org. https://doi.org/10.5441/002/edbt.2020.28