TY - GEN
T1 - Manually detecting errors for data cleaning using adaptive crowdsourcing strategies
AU - Zhang, Haojun
AU - Chai, Chengliang
AU - Doan, An Hai
AU - Koutris, Paraschos
AU - Arcaute, Esteban
N1 - Publisher Copyright:
© 2020 Copyright held by the owner/author(s).
PY - 2020
Y1 - 2020
N2 - Current work to detect data errors often uses (semi-)automatic solutions. In this paper, however, we argue that there are many real-world scenarios where users have to detect data errors completely manually, and that more attention should be devoted to this problem. We then study one instance of this problem in depth. Specifically, we focus on the problem of manually verifying the values of a target attribute, and shows that the current best solution in industry, which uses crowdsourcing, has significant limitations. We develop a new solution that addresses the above limitations. Our solution can find a much more accurate ranking of the data values in terms of their difficulties for crowdsourcing, can help domain experts debug this ranking, and can handle ambiguous values for which no golden answers exist. Importantly, our solution provides a unified framework that allows users to easily express and solve a broad range of optimization problems for crowdsourcing, to balance between cost and accuracy. Finally, we describe extensive experiments with three real-world data sets that demonstrate the utility and promise of our solution approach.
AB - Current work to detect data errors often uses (semi-)automatic solutions. In this paper, however, we argue that there are many real-world scenarios where users have to detect data errors completely manually, and that more attention should be devoted to this problem. We then study one instance of this problem in depth. Specifically, we focus on the problem of manually verifying the values of a target attribute, and shows that the current best solution in industry, which uses crowdsourcing, has significant limitations. We develop a new solution that addresses the above limitations. Our solution can find a much more accurate ranking of the data values in terms of their difficulties for crowdsourcing, can help domain experts debug this ranking, and can handle ambiguous values for which no golden answers exist. Importantly, our solution provides a unified framework that allows users to easily express and solve a broad range of optimization problems for crowdsourcing, to balance between cost and accuracy. Finally, we describe extensive experiments with three real-world data sets that demonstrate the utility and promise of our solution approach.
UR - http://www.scopus.com/inward/record.url?scp=85084190062&partnerID=8YFLogxK
U2 - 10.5441/002/edbt.2020.28
DO - 10.5441/002/edbt.2020.28
M3 - Conference contribution
AN - SCOPUS:85084190062
T3 - Advances in Database Technology - EDBT
SP - 311
EP - 322
BT - Advances in Database Technology - EDBT 2020
A2 - Bonifati, Angela
A2 - Zhou, Yongluan
A2 - Vaz Salles, Marcos Antonio
A2 - Bohm, Alexander
A2 - Olteanu, Dan
A2 - Fletcher, George
A2 - Khan, Arijit
A2 - Yang, Bin
PB - OpenProceedings.org
T2 - 23rd International Conference on Extending Database Technology, EDBT 2020
Y2 - 30 March 2020 through 2 April 2020
ER -