Abstract
Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.
Original language | English |
---|---|
Pages (from-to) | 2241-2251 |
Number of pages | 11 |
Journal | Journal of Frontiers of Computer Science and Technology |
Volume | 17 |
Issue number | 9 |
DOIs | |
Publication status | Published - 1 Sept 2023 |
Keywords
- K-nearest neighbor (KNN)
- Shapley value
- cleaning priority
- data cleaning
- incomplete dataset