Manually detecting errors for data cleaning using adaptive crowdsourcing strategies

Haojun Zhang; Chengliang Chai; An Hai Doan; Paraschos Koutris; Esteban Arcaute

doi:10.5441/002/edbt.2020.28

Manually detecting errors for data cleaning using adaptive crowdsourcing strategies

Haojun Zhang, Chengliang Chai, An Hai Doan, Paraschos Koutris, Esteban Arcaute

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Citations (Scopus)

Abstract

Current work to detect data errors often uses (semi-)automatic solutions. In this paper, however, we argue that there are many real-world scenarios where users have to detect data errors completely manually, and that more attention should be devoted to this problem. We then study one instance of this problem in depth. Specifically, we focus on the problem of manually verifying the values of a target attribute, and shows that the current best solution in industry, which uses crowdsourcing, has significant limitations. We develop a new solution that addresses the above limitations. Our solution can find a much more accurate ranking of the data values in terms of their difficulties for crowdsourcing, can help domain experts debug this ranking, and can handle ambiguous values for which no golden answers exist. Importantly, our solution provides a unified framework that allows users to easily express and solve a broad range of optimization problems for crowdsourcing, to balance between cost and accuracy. Finally, we describe extensive experiments with three real-world data sets that demonstrate the utility and promise of our solution approach.

Original language	English
Title of host publication	Advances in Database Technology - EDBT 2020
Subtitle of host publication	23rd International Conference on Extending Database Technology, Proceedings
Editors	Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Bohm, Dan Olteanu, George Fletcher, Arijit Khan, Bin Yang
Publisher	OpenProceedings.org
Pages	311-322
Number of pages	12
ISBN (Electronic)	9783893180837
DOIs	https://doi.org/10.5441/002/edbt.2020.28
Publication status	Published - 2020
Externally published	Yes
Event	23rd International Conference on Extending Database Technology, EDBT 2020 - Copenhagen, Denmark Duration: 30 Mar 2020 → 2 Apr 2020

Publication series

Name	Advances in Database Technology - EDBT
Volume	2020-March
ISSN (Electronic)	2367-2005

Conference

Conference	23rd International Conference on Extending Database Technology, EDBT 2020
Country/Territory	Denmark
City	Copenhagen
Period	30/03/20 → 2/04/20

Access to Document

10.5441/002/edbt.2020.28

Cite this

Zhang, H., Chai, C., Doan, A. H., Koutris, P., & Arcaute, E. (2020). Manually detecting errors for data cleaning using adaptive crowdsourcing strategies. In A. Bonifati, Y. Zhou, M. A. Vaz Salles, A. Bohm, D. Olteanu, G. Fletcher, A. Khan, & B. Yang (Eds.), Advances in Database Technology - EDBT 2020: 23rd International Conference on Extending Database Technology, Proceedings (pp. 311-322). (Advances in Database Technology - EDBT; Vol. 2020-March). OpenProceedings.org. https://doi.org/10.5441/002/edbt.2020.28

Zhang, Haojun ; Chai, Chengliang ; Doan, An Hai et al. / Manually detecting errors for data cleaning using adaptive crowdsourcing strategies. Advances in Database Technology - EDBT 2020: 23rd International Conference on Extending Database Technology, Proceedings. editor / Angela Bonifati ; Yongluan Zhou ; Marcos Antonio Vaz Salles ; Alexander Bohm ; Dan Olteanu ; George Fletcher ; Arijit Khan ; Bin Yang. OpenProceedings.org, 2020. pp. 311-322 (Advances in Database Technology - EDBT).

@inproceedings{02d419ffa6354d07a03a5a44627ed949,

title = "Manually detecting errors for data cleaning using adaptive crowdsourcing strategies",

abstract = "Current work to detect data errors often uses (semi-)automatic solutions. In this paper, however, we argue that there are many real-world scenarios where users have to detect data errors completely manually, and that more attention should be devoted to this problem. We then study one instance of this problem in depth. Specifically, we focus on the problem of manually verifying the values of a target attribute, and shows that the current best solution in industry, which uses crowdsourcing, has significant limitations. We develop a new solution that addresses the above limitations. Our solution can find a much more accurate ranking of the data values in terms of their difficulties for crowdsourcing, can help domain experts debug this ranking, and can handle ambiguous values for which no golden answers exist. Importantly, our solution provides a unified framework that allows users to easily express and solve a broad range of optimization problems for crowdsourcing, to balance between cost and accuracy. Finally, we describe extensive experiments with three real-world data sets that demonstrate the utility and promise of our solution approach.",

author = "Haojun Zhang and Chengliang Chai and Doan, {An Hai} and Paraschos Koutris and Esteban Arcaute",

note = "Publisher Copyright: {\textcopyright} 2020 Copyright held by the owner/author(s).; 23rd International Conference on Extending Database Technology, EDBT 2020 ; Conference date: 30-03-2020 Through 02-04-2020",

year = "2020",

doi = "10.5441/002/edbt.2020.28",

language = "English",

series = "Advances in Database Technology - EDBT",

publisher = "OpenProceedings.org",

pages = "311--322",

editor = "Angela Bonifati and Yongluan Zhou and {Vaz Salles}, {Marcos Antonio} and Alexander Bohm and Dan Olteanu and George Fletcher and Arijit Khan and Bin Yang",

booktitle = "Advances in Database Technology - EDBT 2020",

}

Zhang, H, Chai, C, Doan, AH, Koutris, P & Arcaute, E 2020, Manually detecting errors for data cleaning using adaptive crowdsourcing strategies. in A Bonifati, Y Zhou, MA Vaz Salles, A Bohm, D Olteanu, G Fletcher, A Khan & B Yang (eds), Advances in Database Technology - EDBT 2020: 23rd International Conference on Extending Database Technology, Proceedings. Advances in Database Technology - EDBT, vol. 2020-March, OpenProceedings.org, pp. 311-322, 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, 30/03/20. https://doi.org/10.5441/002/edbt.2020.28

Manually detecting errors for data cleaning using adaptive crowdsourcing strategies. / Zhang, Haojun; Chai, Chengliang; Doan, An Hai et al.
Advances in Database Technology - EDBT 2020: 23rd International Conference on Extending Database Technology, Proceedings. ed. / Angela Bonifati; Yongluan Zhou; Marcos Antonio Vaz Salles; Alexander Bohm; Dan Olteanu; George Fletcher; Arijit Khan; Bin Yang. OpenProceedings.org, 2020. p. 311-322 (Advances in Database Technology - EDBT; Vol. 2020-March).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Manually detecting errors for data cleaning using adaptive crowdsourcing strategies

AU - Zhang, Haojun

AU - Chai, Chengliang

AU - Doan, An Hai

AU - Koutris, Paraschos

AU - Arcaute, Esteban

PY - 2020

Y1 - 2020

N2 - Current work to detect data errors often uses (semi-)automatic solutions. In this paper, however, we argue that there are many real-world scenarios where users have to detect data errors completely manually, and that more attention should be devoted to this problem. We then study one instance of this problem in depth. Specifically, we focus on the problem of manually verifying the values of a target attribute, and shows that the current best solution in industry, which uses crowdsourcing, has significant limitations. We develop a new solution that addresses the above limitations. Our solution can find a much more accurate ranking of the data values in terms of their difficulties for crowdsourcing, can help domain experts debug this ranking, and can handle ambiguous values for which no golden answers exist. Importantly, our solution provides a unified framework that allows users to easily express and solve a broad range of optimization problems for crowdsourcing, to balance between cost and accuracy. Finally, we describe extensive experiments with three real-world data sets that demonstrate the utility and promise of our solution approach.

AB - Current work to detect data errors often uses (semi-)automatic solutions. In this paper, however, we argue that there are many real-world scenarios where users have to detect data errors completely manually, and that more attention should be devoted to this problem. We then study one instance of this problem in depth. Specifically, we focus on the problem of manually verifying the values of a target attribute, and shows that the current best solution in industry, which uses crowdsourcing, has significant limitations. We develop a new solution that addresses the above limitations. Our solution can find a much more accurate ranking of the data values in terms of their difficulties for crowdsourcing, can help domain experts debug this ranking, and can handle ambiguous values for which no golden answers exist. Importantly, our solution provides a unified framework that allows users to easily express and solve a broad range of optimization problems for crowdsourcing, to balance between cost and accuracy. Finally, we describe extensive experiments with three real-world data sets that demonstrate the utility and promise of our solution approach.

UR - http://www.scopus.com/inward/record.url?scp=85084190062&partnerID=8YFLogxK

U2 - 10.5441/002/edbt.2020.28

DO - 10.5441/002/edbt.2020.28

M3 - Conference contribution

AN - SCOPUS:85084190062

T3 - Advances in Database Technology - EDBT

SP - 311

EP - 322

BT - Advances in Database Technology - EDBT 2020

A2 - Bonifati, Angela

A2 - Zhou, Yongluan

A2 - Vaz Salles, Marcos Antonio

A2 - Bohm, Alexander

A2 - Olteanu, Dan

A2 - Fletcher, George

A2 - Khan, Arijit

A2 - Yang, Bin

PB - OpenProceedings.org

T2 - 23rd International Conference on Extending Database Technology, EDBT 2020

Y2 - 30 March 2020 through 2 April 2020

ER -

Zhang H, Chai C, Doan AH, Koutris P, Arcaute E. Manually detecting errors for data cleaning using adaptive crowdsourcing strategies. In Bonifati A, Zhou Y, Vaz Salles MA, Bohm A, Olteanu D, Fletcher G, Khan A, Yang B, editors, Advances in Database Technology - EDBT 2020: 23rd International Conference on Extending Database Technology, Proceedings. OpenProceedings.org. 2020. p. 311-322. (Advances in Database Technology - EDBT). doi: 10.5441/002/edbt.2020.28

Manually detecting errors for data cleaning using adaptive crowdsourcing strategies

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this