IDE: A System for Iterative Mislabel Detection

Yuhao Deng, Deng Qiyan, Chengliang Chai*, Lei Cao, Nan Tang, Ju Fan, Jiayi Wang, Ye Yuan, Guoren Wang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

While machine learning techniques, especially deep neural networks, have shown remarkable success in various applications, their performance is adversely affected by label errors in training data. Acquiring high-quality annotated data is both costly and time-consuming in real-world scenarios, requiring extensive human annotation and verification. Consequently, many industry-applied models are trained over data containing substantial noise, significantly degrading the performance of these models. To address this critical issue, we demonstrate IDE, a novel system that iteratively detects mislabeled instances and repairs the wrong labels. Specifically, IDE leverages the early loss observation and influence-based verification to iteratively identify mislabeled instances. When the mislabeled instances are obtained in each iteration, IDE will repair their labels to enhance detection accuracy for subsequent iterations. The framework automatically determines the termination point when the early loss is no longer effective. For uncertain instances, it generates pseudo labels to train a binary classification model, leveraging the model's generalization ability to make the final decision. With a real-life scenario, we demonstrate that IDE produces high-quality training data by effective mislabel detection and repair.

Original languageEnglish
Title of host publicationSIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data
PublisherAssociation for Computing Machinery
Pages500-503
Number of pages4
ISBN (Electronic)9798400704222
DOIs
Publication statusPublished - 9 Jun 2024
Event2024 International Conference on Management of Data, SIGMOD 2024 - Santiago, Chile
Duration: 9 Jun 202415 Jun 2024

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2024 International Conference on Management of Data, SIGMOD 2024
Country/TerritoryChile
CitySantiago
Period9/06/2415/06/24

Keywords

  • data cleaning
  • influence function
  • mislabel detection

Fingerprint

Dive into the research topics of 'IDE: A System for Iterative Mislabel Detection'. Together they form a unique fingerprint.

Cite this

Deng, Y., Qiyan, D., Chai, C., Cao, L., Tang, N., Fan, J., Wang, J., Yuan, Y., & Wang, G. (2024). IDE: A System for Iterative Mislabel Detection. In SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data (pp. 500-503). (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/3626246.3654737