TY - JOUR
T1 - Mind the Boundary
T2 - 41st International Conference on Machine Learning, ICML 2024
AU - Yang, Shuo
AU - Cao, Zhe
AU - Guo, Sheng
AU - Zhang, Ruiheng
AU - Luo, Ping
AU - Zhang, Shengping
AU - Nie, Liqiang
N1 - Publisher Copyright:
Copyright 2024 by the author(s)
PY - 2024
Y1 - 2024
N2 - Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.
AB - Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.
UR - http://www.scopus.com/inward/record.url?scp=85203798427&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85203798427
SN - 2640-3498
VL - 235
SP - 55948
EP - 55960
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
Y2 - 21 July 2024 through 27 July 2024
ER -