TY - GEN
T1 - IDE
T2 - 2024 International Conference on Management of Data, SIGMOD 2024
AU - Deng, Yuhao
AU - Qiyan, Deng
AU - Chai, Chengliang
AU - Cao, Lei
AU - Tang, Nan
AU - Fan, Ju
AU - Wang, Jiayi
AU - Yuan, Ye
AU - Wang, Guoren
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/6/9
Y1 - 2024/6/9
N2 - While machine learning techniques, especially deep neural networks, have shown remarkable success in various applications, their performance is adversely affected by label errors in training data. Acquiring high-quality annotated data is both costly and time-consuming in real-world scenarios, requiring extensive human annotation and verification. Consequently, many industry-applied models are trained over data containing substantial noise, significantly degrading the performance of these models. To address this critical issue, we demonstrate IDE, a novel system that iteratively detects mislabeled instances and repairs the wrong labels. Specifically, IDE leverages the early loss observation and influence-based verification to iteratively identify mislabeled instances. When the mislabeled instances are obtained in each iteration, IDE will repair their labels to enhance detection accuracy for subsequent iterations. The framework automatically determines the termination point when the early loss is no longer effective. For uncertain instances, it generates pseudo labels to train a binary classification model, leveraging the model's generalization ability to make the final decision. With a real-life scenario, we demonstrate that IDE produces high-quality training data by effective mislabel detection and repair.
AB - While machine learning techniques, especially deep neural networks, have shown remarkable success in various applications, their performance is adversely affected by label errors in training data. Acquiring high-quality annotated data is both costly and time-consuming in real-world scenarios, requiring extensive human annotation and verification. Consequently, many industry-applied models are trained over data containing substantial noise, significantly degrading the performance of these models. To address this critical issue, we demonstrate IDE, a novel system that iteratively detects mislabeled instances and repairs the wrong labels. Specifically, IDE leverages the early loss observation and influence-based verification to iteratively identify mislabeled instances. When the mislabeled instances are obtained in each iteration, IDE will repair their labels to enhance detection accuracy for subsequent iterations. The framework automatically determines the termination point when the early loss is no longer effective. For uncertain instances, it generates pseudo labels to train a binary classification model, leveraging the model's generalization ability to make the final decision. With a real-life scenario, we demonstrate that IDE produces high-quality training data by effective mislabel detection and repair.
KW - data cleaning
KW - influence function
KW - mislabel detection
UR - http://www.scopus.com/inward/record.url?scp=85195636198&partnerID=8YFLogxK
U2 - 10.1145/3626246.3654737
DO - 10.1145/3626246.3654737
M3 - Conference contribution
AN - SCOPUS:85195636198
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 500
EP - 503
BT - SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data
PB - Association for Computing Machinery
Y2 - 9 June 2024 through 15 June 2024
ER -