Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

Jingyi Wang, Yinjia Chen, Ye Yuan*, Chen Chen, Guoren Wang

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

1 引用 (Scopus)

摘要

Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.

源语言英语
页(从-至)2241-2251
页数11
期刊Journal of Frontiers of Computer Science and Technology
17
9
DOI
出版状态已出版 - 1 9月 2023

指纹

探究 'Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models' 的科研主题。它们共同构成独一无二的指纹。

引用此