Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary

Shuo Yang; Zhe Cao; Sheng Guo; Ruiheng Zhang; Ping Luo; Shengping Zhang; Liqiang Nie

Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary

Shuo Yang, Zhe Cao, Sheng Guo, Ruiheng Zhang, Ping Luo, Shengping Zhang^*, Liqiang Nie

^*此作品的通讯作者

机电学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

摘要

Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.

源语言	英语
页（从-至）	55948-55960
页数	13
期刊	Proceedings of Machine Learning Research
卷	235
出版状态	已出版 - 2024
活动	41st International Conference on Machine Learning, ICML 2024 - Vienna, 奥地利期限: 21 7月 2024 → 27 7月 2024

其它文件与链接

链接到 Scopus 的出版物

引用此

Yang, S., Cao, Z., Guo, S., Zhang, R., Luo, P., Zhang, S., & Nie, L. (2024). Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary. Proceedings of Machine Learning Research, 235, 55948-55960.

@article{c46cf73e0a5f42bab951752f2f780eab,

title = "Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary",

abstract = "Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.",

author = "Shuo Yang and Zhe Cao and Sheng Guo and Ruiheng Zhang and Ping Luo and Shengping Zhang and Liqiang Nie",

year = "2024",

language = "English",

volume = "235",

pages = "55948--55960",

journal = "Proceedings of Machine Learning Research",

issn = "2640-3498",

publisher = "ML Research Press",

}

TY - JOUR

T1 - Mind the Boundary

T2 - 41st International Conference on Machine Learning, ICML 2024

AU - Yang, Shuo

AU - Cao, Zhe

AU - Guo, Sheng

AU - Zhang, Ruiheng

AU - Luo, Ping

AU - Zhang, Shengping

AU - Nie, Liqiang

PY - 2024

Y1 - 2024

N2 - Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.

AB - Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.

UR - http://www.scopus.com/inward/record.url?scp=85203798427&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85203798427

SN - 2640-3498

VL - 235

SP - 55948

EP - 55960

JO - Proceedings of Machine Learning Research

JF - Proceedings of Machine Learning Research

Y2 - 21 July 2024 through 27 July 2024

ER -

Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary

摘要

其它文件与链接

指纹

引用此