TY - JOUR
T1 - MisDetect
T2 - 50th International Conference on Very Large Data Bases, VLDB 2024
AU - Deng, Yuhao
AU - Chai, Chengliang
AU - Cao, Lei
AU - Tang, Nan
AU - Wang, Jiayi
AU - Fan, Ju
AU - Yuan, Ye
AU - Wang, Guoren
N1 - Publisher Copyright:
© 2024, VLDB Endowment. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered mislabeled if its label is inconsistent with its neighbors. However, it often performs poorly, because an instance does not always share the same label with its neighbors. ML-based methods instead utilize trained models to differentiate between mislabeled and clean instances. However, these methods struggle to achieve high accuracy, since the models may have already overfitted mislabeled instances. In this paper, we propose a novel framework, MisDetect, that detects mislabeled instances during model training. MisDetect leverages the early loss observation to iteratively identify and remove mislabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines when the early loss is no longer effective in detecting mislabels such that the iterative detection process should terminate. Finally, for the training instances that MisDetect is still not certain about whether they are mislabeled or not, MisDetect automatically produces some pseudo labels to learn a binary classification model and leverages the generalization ability of the machine learning model to determine their status. Our experiments on 15 datasets show that MisDetect outperforms 10 baseline methods, demonstrating its effectiveness in detecting mislabeled instances.
AB - Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered mislabeled if its label is inconsistent with its neighbors. However, it often performs poorly, because an instance does not always share the same label with its neighbors. ML-based methods instead utilize trained models to differentiate between mislabeled and clean instances. However, these methods struggle to achieve high accuracy, since the models may have already overfitted mislabeled instances. In this paper, we propose a novel framework, MisDetect, that detects mislabeled instances during model training. MisDetect leverages the early loss observation to iteratively identify and remove mislabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines when the early loss is no longer effective in detecting mislabels such that the iterative detection process should terminate. Finally, for the training instances that MisDetect is still not certain about whether they are mislabeled or not, MisDetect automatically produces some pseudo labels to learn a binary classification model and leverages the generalization ability of the machine learning model to determine their status. Our experiments on 15 datasets show that MisDetect outperforms 10 baseline methods, demonstrating its effectiveness in detecting mislabeled instances.
UR - http://www.scopus.com/inward/record.url?scp=85190665640&partnerID=8YFLogxK
U2 - 10.14778/3648160.3648161
DO - 10.14778/3648160.3648161
M3 - Conference article
AN - SCOPUS:85190665640
SN - 2150-8097
VL - 17
SP - 1159
EP - 1172
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 6
Y2 - 24 August 2024 through 29 August 2024
ER -