Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary

Shuo Yang; Zhe Cao; Sheng Guo; Ruiheng Zhang; Ping Luo; Shengping Zhang; Liqiang Nie

Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary

Shuo Yang, Zhe Cao, Sheng Guo, Ruiheng Zhang, Ping Luo, Shengping Zhang^*, Liqiang Nie

^*Corresponding author for this work

School of Mechatronical Engineering

Research output: Contribution to journal › Conference article › peer-review

Abstract

Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.

Original language	English
Pages (from-to)	55948-55960
Number of pages	13
Journal	Proceedings of Machine Learning Research
Volume	235
Publication status	Published - 2024
Event	41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria Duration: 21 Jul 2024 → 27 Jul 2024

Cite this

@article{c46cf73e0a5f42bab951752f2f780eab,

title = "Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary",

abstract = "Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.",

author = "Shuo Yang and Zhe Cao and Sheng Guo and Ruiheng Zhang and Ping Luo and Shengping Zhang and Liqiang Nie",

year = "2024",

language = "English",

volume = "235",

pages = "55948--55960",

journal = "Proceedings of Machine Learning Research",

issn = "2640-3498",

publisher = "ML Research Press",

}

TY - JOUR

T1 - Mind the Boundary

T2 - 41st International Conference on Machine Learning, ICML 2024

AU - Yang, Shuo

AU - Cao, Zhe

AU - Guo, Sheng

AU - Zhang, Ruiheng

AU - Luo, Ping

AU - Zhang, Shengping

AU - Nie, Liqiang

PY - 2024

Y1 - 2024

N2 - Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.

AB - Existing paradigms of pushing the state of the art require exponentially more training data in many fields. Coreset selection seeks to mitigate this growing demand by identifying the most efficient subset of training data. In this paper, we delve into geometry-based coreset methods and preliminarily link the geometry of data distribution with models' generalization capability in theoretics. Leveraging these theoretical insights, we propose a novel coreset construction method by selecting training samples to reconstruct the decision boundary of a deep neural network learned on the full dataset. Extensive experiments across various popular benchmarks demonstrate the superiority of our method over multiple competitors. For the first time, our method achieves a 50% data pruning rate on the ImageNet-1K dataset while sacrificing less than 1% in accuracy. Additionally, we showcase and analyze the remarkable cross-architecture transferability of the coresets derived from our approach.

UR - http://www.scopus.com/inward/record.url?scp=85203798427&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85203798427

SN - 2640-3498

VL - 235

SP - 55948

EP - 55960

JO - Proceedings of Machine Learning Research

JF - Proceedings of Machine Learning Research

Y2 - 21 July 2024 through 27 July 2024

ER -

Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary

Abstract

Other files and links

Fingerprint

Cite this