Manually detecting errors for data cleaning using adaptive crowdsourcing strategies

Haojun Zhang, Chengliang Chai, An Hai Doan, Paraschos Koutris, Esteban Arcaute

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Citations (Scopus)

Abstract

Current work to detect data errors often uses (semi-)automatic solutions. In this paper, however, we argue that there are many real-world scenarios where users have to detect data errors completely manually, and that more attention should be devoted to this problem. We then study one instance of this problem in depth. Specifically, we focus on the problem of manually verifying the values of a target attribute, and shows that the current best solution in industry, which uses crowdsourcing, has significant limitations. We develop a new solution that addresses the above limitations. Our solution can find a much more accurate ranking of the data values in terms of their difficulties for crowdsourcing, can help domain experts debug this ranking, and can handle ambiguous values for which no golden answers exist. Importantly, our solution provides a unified framework that allows users to easily express and solve a broad range of optimization problems for crowdsourcing, to balance between cost and accuracy. Finally, we describe extensive experiments with three real-world data sets that demonstrate the utility and promise of our solution approach.

Original languageEnglish
Title of host publicationAdvances in Database Technology - EDBT 2020
Subtitle of host publication23rd International Conference on Extending Database Technology, Proceedings
EditorsAngela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Bohm, Dan Olteanu, George Fletcher, Arijit Khan, Bin Yang
PublisherOpenProceedings.org
Pages311-322
Number of pages12
ISBN (Electronic)9783893180837
DOIs
Publication statusPublished - 2020
Externally publishedYes
Event23rd International Conference on Extending Database Technology, EDBT 2020 - Copenhagen, Denmark
Duration: 30 Mar 20202 Apr 2020

Publication series

NameAdvances in Database Technology - EDBT
Volume2020-March
ISSN (Electronic)2367-2005

Conference

Conference23rd International Conference on Extending Database Technology, EDBT 2020
Country/TerritoryDenmark
CityCopenhagen
Period30/03/202/04/20

Fingerprint

Dive into the research topics of 'Manually detecting errors for data cleaning using adaptive crowdsourcing strategies'. Together they form a unique fingerprint.

Cite this