MisDetect: Iterative Mislabel Detection using Early Loss

Yuhao Deng; Chengliang Chai; Lei Cao; Nan Tang; Jiayi Wang; Ju Fan; Ye Yuan; Guoren Wang

doi:10.14778/3648160.3648161

MisDetect: Iterative Mislabel Detection using Early Loss

Yuhao Deng, Chengliang Chai^*, Lei Cao, Nan Tang, Jiayi Wang, Ju Fan, Ye Yuan, Guoren Wang

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Conference article › peer-review

3 Citations (Scopus)

Abstract

Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered mislabeled if its label is inconsistent with its neighbors. However, it often performs poorly, because an instance does not always share the same label with its neighbors. ML-based methods instead utilize trained models to differentiate between mislabeled and clean instances. However, these methods struggle to achieve high accuracy, since the models may have already overfitted mislabeled instances. In this paper, we propose a novel framework, MisDetect, that detects mislabeled instances during model training. MisDetect leverages the early loss observation to iteratively identify and remove mislabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines when the early loss is no longer effective in detecting mislabels such that the iterative detection process should terminate. Finally, for the training instances that MisDetect is still not certain about whether they are mislabeled or not, MisDetect automatically produces some pseudo labels to learn a binary classification model and leverages the generalization ability of the machine learning model to determine their status. Our experiments on 15 datasets show that MisDetect outperforms 10 baseline methods, demonstrating its effectiveness in detecting mislabeled instances.

Original language	English
Pages (from-to)	1159-1172
Number of pages	14
Journal	Proceedings of the VLDB Endowment
Volume	17
Issue number	6
DOIs	https://doi.org/10.14778/3648160.3648161
Publication status	Published - 2024
Event	50th International Conference on Very Large Data Bases, VLDB 2024 - Guangzhou, China Duration: 24 Aug 2024 → 29 Aug 2024

Access to Document

10.14778/3648160.3648161

Cite this

Deng, Y., Chai, C., Cao, L., Tang, N., Wang, J., Fan, J., Yuan, Y., & Wang, G. (2024). MisDetect: Iterative Mislabel Detection using Early Loss. Proceedings of the VLDB Endowment, 17(6), 1159-1172. https://doi.org/10.14778/3648160.3648161

@article{59c2af72abce4c88a4d41784d0dc2803,

title = "MisDetect: Iterative Mislabel Detection using Early Loss",

abstract = "Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered mislabeled if its label is inconsistent with its neighbors. However, it often performs poorly, because an instance does not always share the same label with its neighbors. ML-based methods instead utilize trained models to differentiate between mislabeled and clean instances. However, these methods struggle to achieve high accuracy, since the models may have already overfitted mislabeled instances. In this paper, we propose a novel framework, MisDetect, that detects mislabeled instances during model training. MisDetect leverages the early loss observation to iteratively identify and remove mislabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines when the early loss is no longer effective in detecting mislabels such that the iterative detection process should terminate. Finally, for the training instances that MisDetect is still not certain about whether they are mislabeled or not, MisDetect automatically produces some pseudo labels to learn a binary classification model and leverages the generalization ability of the machine learning model to determine their status. Our experiments on 15 datasets show that MisDetect outperforms 10 baseline methods, demonstrating its effectiveness in detecting mislabeled instances.",

author = "Yuhao Deng and Chengliang Chai and Lei Cao and Nan Tang and Jiayi Wang and Ju Fan and Ye Yuan and Guoren Wang",

note = "Publisher Copyright: {\textcopyright} 2024, VLDB Endowment. All rights reserved.; 50th International Conference on Very Large Data Bases, VLDB 2024 ; Conference date: 24-08-2024 Through 29-08-2024",

year = "2024",

doi = "10.14778/3648160.3648161",

language = "English",

volume = "17",

pages = "1159--1172",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Very Large Data Base Endowment Inc.",

number = "6",

}

TY - JOUR

T1 - MisDetect

T2 - 50th International Conference on Very Large Data Bases, VLDB 2024

AU - Deng, Yuhao

AU - Chai, Chengliang

AU - Cao, Lei

AU - Tang, Nan

AU - Wang, Jiayi

AU - Fan, Ju

AU - Yuan, Ye

AU - Wang, Guoren

PY - 2024

Y1 - 2024

N2 - Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered mislabeled if its label is inconsistent with its neighbors. However, it often performs poorly, because an instance does not always share the same label with its neighbors. ML-based methods instead utilize trained models to differentiate between mislabeled and clean instances. However, these methods struggle to achieve high accuracy, since the models may have already overfitted mislabeled instances. In this paper, we propose a novel framework, MisDetect, that detects mislabeled instances during model training. MisDetect leverages the early loss observation to iteratively identify and remove mislabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines when the early loss is no longer effective in detecting mislabels such that the iterative detection process should terminate. Finally, for the training instances that MisDetect is still not certain about whether they are mislabeled or not, MisDetect automatically produces some pseudo labels to learn a binary classification model and leverages the generalization ability of the machine learning model to determine their status. Our experiments on 15 datasets show that MisDetect outperforms 10 baseline methods, demonstrating its effectiveness in detecting mislabeled instances.

AB - Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered mislabeled if its label is inconsistent with its neighbors. However, it often performs poorly, because an instance does not always share the same label with its neighbors. ML-based methods instead utilize trained models to differentiate between mislabeled and clean instances. However, these methods struggle to achieve high accuracy, since the models may have already overfitted mislabeled instances. In this paper, we propose a novel framework, MisDetect, that detects mislabeled instances during model training. MisDetect leverages the early loss observation to iteratively identify and remove mislabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines when the early loss is no longer effective in detecting mislabels such that the iterative detection process should terminate. Finally, for the training instances that MisDetect is still not certain about whether they are mislabeled or not, MisDetect automatically produces some pseudo labels to learn a binary classification model and leverages the generalization ability of the machine learning model to determine their status. Our experiments on 15 datasets show that MisDetect outperforms 10 baseline methods, demonstrating its effectiveness in detecting mislabeled instances.

UR - http://www.scopus.com/inward/record.url?scp=85190665640&partnerID=8YFLogxK

U2 - 10.14778/3648160.3648161

DO - 10.14778/3648160.3648161

M3 - Conference article

AN - SCOPUS:85190665640

SN - 2150-8097

VL - 17

SP - 1159

EP - 1172

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 6

Y2 - 24 August 2024 through 29 August 2024

ER -

MisDetect: Iterative Mislabel Detection using Early Loss

Abstract

Access to Document

Other files and links

Fingerprint

Cite this