Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

Jingyi Wang; Yinjia Chen; Ye Yuan; Chen Chen; Guoren Wang

doi:10.3778/j.issn.1673-9418.2207105

Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

Jingyi Wang, Yinjia Chen, Ye Yuan^*, Chen Chen, Guoren Wang

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.

Original language	English
Pages (from-to)	2241-2251
Number of pages	11
Journal	Journal of Frontiers of Computer Science and Technology
Volume	17
Issue number	9
DOIs	https://doi.org/10.3778/j.issn.1673-9418.2207105
Publication status	Published - 1 Sept 2023

Keywords

K-nearest neighbor (KNN)
Shapley value
cleaning priority
data cleaning
incomplete dataset

Access to Document

10.3778/j.issn.1673-9418.2207105

Cite this

@article{da37afb485224942baef6a22c68221cf,

title = "Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models",

abstract = "Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.",

keywords = "K-nearest neighbor (KNN), Shapley value, cleaning priority, data cleaning, incomplete dataset",

author = "Jingyi Wang and Yinjia Chen and Ye Yuan and Chen Chen and Guoren Wang",

year = "2023",

month = sep,

day = "1",

doi = "10.3778/j.issn.1673-9418.2207105",

language = "English",

volume = "17",

pages = "2241--2251",

journal = "Journal of Frontiers of Computer Science and Technology",

issn = "1673-9418",

publisher = "Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press",

number = "9",

}

TY - JOUR

T1 - Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

AU - Wang, Jingyi

AU - Chen, Yinjia

AU - Yuan, Ye

AU - Chen, Chen

AU - Wang, Guoren

PY - 2023/9/1

Y1 - 2023/9/1

N2 - Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.

AB - Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.

KW - K-nearest neighbor (KNN)

KW - Shapley value

KW - cleaning priority

KW - data cleaning

KW - incomplete dataset

UR - http://www.scopus.com/inward/record.url?scp=85175054522&partnerID=8YFLogxK

U2 - 10.3778/j.issn.1673-9418.2207105

DO - 10.3778/j.issn.1673-9418.2207105

M3 - Article

AN - SCOPUS:85175054522

SN - 1673-9418

VL - 17

SP - 2241

EP - 2251

JO - Journal of Frontiers of Computer Science and Technology

JF - Journal of Frontiers of Computer Science and Technology

IS - 9

ER -

Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this