Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

Jingyi Wang; Yinjia Chen; Ye Yuan; Chen Chen; Guoren Wang

doi:10.3778/j.issn.1673-9418.2207105

Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

Jingyi Wang, Yinjia Chen, Ye Yuan^*, Chen Chen, Guoren Wang

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.

源语言	英语
页（从-至）	2241-2251
页数	11
期刊	Journal of Frontiers of Computer Science and Technology
卷	17
期	9
DOI	https://doi.org/10.3778/j.issn.1673-9418.2207105
出版状态	已出版 - 1 9月 2023

访问文件

10.3778/j.issn.1673-9418.2207105

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{da37afb485224942baef6a22c68221cf,

title = "Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models",

abstract = "Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.",

keywords = "K-nearest neighbor (KNN), Shapley value, cleaning priority, data cleaning, incomplete dataset",

author = "Jingyi Wang and Yinjia Chen and Ye Yuan and Chen Chen and Guoren Wang",

year = "2023",

month = sep,

day = "1",

doi = "10.3778/j.issn.1673-9418.2207105",

language = "English",

volume = "17",

pages = "2241--2251",

journal = "Journal of Frontiers of Computer Science and Technology",

issn = "1673-9418",

publisher = "Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press",

number = "9",

}

TY - JOUR

T1 - Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

AU - Wang, Jingyi

AU - Chen, Yinjia

AU - Yuan, Ye

AU - Chen, Chen

AU - Wang, Guoren

PY - 2023/9/1

Y1 - 2023/9/1

N2 - Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.

AB - Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the Shap Clean can save more labourcosts while ensuring the desired model accuracy.

KW - K-nearest neighbor (KNN)

KW - Shapley value

KW - cleaning priority

KW - data cleaning

KW - incomplete dataset

UR - http://www.scopus.com/inward/record.url?scp=85175054522&partnerID=8YFLogxK

U2 - 10.3778/j.issn.1673-9418.2207105

DO - 10.3778/j.issn.1673-9418.2207105

M3 - Article

AN - SCOPUS:85175054522

SN - 1673-9418

VL - 17

SP - 2241

EP - 2251

JO - Journal of Frontiers of Computer Science and Technology

JF - Journal of Frontiers of Computer Science and Technology

IS - 9

ER -

Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

摘要

访问文件

其它文件与链接

指纹

引用此