Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

Jingyi Wang, Yinjia Chen, Ye Yuan*, Chen Chen, Guoren Wang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.

Original languageEnglish
Pages (from-to)2241-2251
Number of pages11
JournalJournal of Frontiers of Computer Science and Technology
Volume17
Issue number9
DOIs
Publication statusPublished - 1 Sept 2023

Keywords

  • K-nearest neighbor (KNN)
  • Shapley value
  • cleaning priority
  • data cleaning
  • incomplete dataset

Fingerprint

Dive into the research topics of 'Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models'. Together they form a unique fingerprint.

Cite this