IDE: A System for Iterative Mislabel Detection

Yuhao Deng; Deng Qiyan; Chengliang Chai; Lei Cao; Nan Tang; Ju Fan; Jiayi Wang; Ye Yuan; Guoren Wang

doi:10.1145/3626246.3654737

IDE: A System for Iterative Mislabel Detection

Yuhao Deng, Deng Qiyan, Chengliang Chai^*, Lei Cao, Nan Tang, Ju Fan, Jiayi Wang, Ye Yuan, Guoren Wang

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

While machine learning techniques, especially deep neural networks, have shown remarkable success in various applications, their performance is adversely affected by label errors in training data. Acquiring high-quality annotated data is both costly and time-consuming in real-world scenarios, requiring extensive human annotation and verification. Consequently, many industry-applied models are trained over data containing substantial noise, significantly degrading the performance of these models. To address this critical issue, we demonstrate IDE, a novel system that iteratively detects mislabeled instances and repairs the wrong labels. Specifically, IDE leverages the early loss observation and influence-based verification to iteratively identify mislabeled instances. When the mislabeled instances are obtained in each iteration, IDE will repair their labels to enhance detection accuracy for subsequent iterations. The framework automatically determines the termination point when the early loss is no longer effective. For uncertain instances, it generates pseudo labels to train a binary classification model, leveraging the model's generalization ability to make the final decision. With a real-life scenario, we demonstrate that IDE produces high-quality training data by effective mislabel detection and repair.

Original language	English
Title of host publication	SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data
Publisher	Association for Computing Machinery
Pages	500-503
Number of pages	4
ISBN (Electronic)	9798400704222
DOIs	https://doi.org/10.1145/3626246.3654737
Publication status	Published - 9 Jun 2024
Event	2024 International Conference on Management of Data, SIGMOD 2024 - Santiago, Chile Duration: 9 Jun 2024 → 15 Jun 2024

Publication series

Name	Proceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)	0730-8078

Conference

Conference	2024 International Conference on Management of Data, SIGMOD 2024
Country/Territory	Chile
City	Santiago
Period	9/06/24 → 15/06/24

Keywords

data cleaning
influence function
mislabel detection

Access to Document

10.1145/3626246.3654737

Cite this

Deng, Y., Qiyan, D., Chai, C., Cao, L., Tang, N., Fan, J., Wang, J., Yuan, Y., & Wang, G. (2024). IDE: A System for Iterative Mislabel Detection. In SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data (pp. 500-503). (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/3626246.3654737

@inproceedings{fc17913a23504d3e9fb09d919b9708d9,

title = "IDE: A System for Iterative Mislabel Detection",

abstract = "While machine learning techniques, especially deep neural networks, have shown remarkable success in various applications, their performance is adversely affected by label errors in training data. Acquiring high-quality annotated data is both costly and time-consuming in real-world scenarios, requiring extensive human annotation and verification. Consequently, many industry-applied models are trained over data containing substantial noise, significantly degrading the performance of these models. To address this critical issue, we demonstrate IDE, a novel system that iteratively detects mislabeled instances and repairs the wrong labels. Specifically, IDE leverages the early loss observation and influence-based verification to iteratively identify mislabeled instances. When the mislabeled instances are obtained in each iteration, IDE will repair their labels to enhance detection accuracy for subsequent iterations. The framework automatically determines the termination point when the early loss is no longer effective. For uncertain instances, it generates pseudo labels to train a binary classification model, leveraging the model's generalization ability to make the final decision. With a real-life scenario, we demonstrate that IDE produces high-quality training data by effective mislabel detection and repair.",

keywords = "data cleaning, influence function, mislabel detection",

author = "Yuhao Deng and Deng Qiyan and Chengliang Chai and Lei Cao and Nan Tang and Ju Fan and Jiayi Wang and Ye Yuan and Guoren Wang",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 2024 International Conference on Management of Data, SIGMOD 2024 ; Conference date: 09-06-2024 Through 15-06-2024",

year = "2024",

month = jun,

day = "9",

doi = "10.1145/3626246.3654737",

language = "English",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

publisher = "Association for Computing Machinery",

pages = "500--503",

booktitle = "SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data",

}

Deng, Y, Qiyan, D, Chai, C, Cao, L, Tang, N, Fan, J, Wang, J, Yuan, Y & Wang, G 2024, IDE: A System for Iterative Mislabel Detection. in SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, pp. 500-503, 2024 International Conference on Management of Data, SIGMOD 2024, Santiago, Chile, 9/06/24. https://doi.org/10.1145/3626246.3654737

IDE: A System for Iterative Mislabel Detection. / Deng, Yuhao; Qiyan, Deng; Chai, Chengliang et al.
SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data. Association for Computing Machinery, 2024. p. 500-503 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - IDE

T2 - 2024 International Conference on Management of Data, SIGMOD 2024

AU - Deng, Yuhao

AU - Qiyan, Deng

AU - Chai, Chengliang

AU - Cao, Lei

AU - Tang, Nan

AU - Fan, Ju

AU - Wang, Jiayi

AU - Yuan, Ye

AU - Wang, Guoren

PY - 2024/6/9

Y1 - 2024/6/9

N2 - While machine learning techniques, especially deep neural networks, have shown remarkable success in various applications, their performance is adversely affected by label errors in training data. Acquiring high-quality annotated data is both costly and time-consuming in real-world scenarios, requiring extensive human annotation and verification. Consequently, many industry-applied models are trained over data containing substantial noise, significantly degrading the performance of these models. To address this critical issue, we demonstrate IDE, a novel system that iteratively detects mislabeled instances and repairs the wrong labels. Specifically, IDE leverages the early loss observation and influence-based verification to iteratively identify mislabeled instances. When the mislabeled instances are obtained in each iteration, IDE will repair their labels to enhance detection accuracy for subsequent iterations. The framework automatically determines the termination point when the early loss is no longer effective. For uncertain instances, it generates pseudo labels to train a binary classification model, leveraging the model's generalization ability to make the final decision. With a real-life scenario, we demonstrate that IDE produces high-quality training data by effective mislabel detection and repair.

AB - While machine learning techniques, especially deep neural networks, have shown remarkable success in various applications, their performance is adversely affected by label errors in training data. Acquiring high-quality annotated data is both costly and time-consuming in real-world scenarios, requiring extensive human annotation and verification. Consequently, many industry-applied models are trained over data containing substantial noise, significantly degrading the performance of these models. To address this critical issue, we demonstrate IDE, a novel system that iteratively detects mislabeled instances and repairs the wrong labels. Specifically, IDE leverages the early loss observation and influence-based verification to iteratively identify mislabeled instances. When the mislabeled instances are obtained in each iteration, IDE will repair their labels to enhance detection accuracy for subsequent iterations. The framework automatically determines the termination point when the early loss is no longer effective. For uncertain instances, it generates pseudo labels to train a binary classification model, leveraging the model's generalization ability to make the final decision. With a real-life scenario, we demonstrate that IDE produces high-quality training data by effective mislabel detection and repair.

KW - data cleaning

KW - influence function

KW - mislabel detection

UR - http://www.scopus.com/inward/record.url?scp=85195636198&partnerID=8YFLogxK

U2 - 10.1145/3626246.3654737

DO - 10.1145/3626246.3654737

M3 - Conference contribution

AN - SCOPUS:85195636198

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 500

EP - 503

BT - SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data

PB - Association for Computing Machinery

Y2 - 9 June 2024 through 15 June 2024

ER -

IDE: A System for Iterative Mislabel Detection

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this