TY - GEN
T1 - Efficient Coreset Selection with Cluster-based Methods
AU - Chai, Chengliang
AU - Wang, Jiayi
AU - Tang, Nan
AU - Yuan, Ye
AU - Liu, Jiabin
AU - Deng, Yuhao
AU - Wang, Guoren
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/8/6
Y1 - 2023/8/6
N2 - Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e. gradient approximation. In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10 times comparing with SOTA almost without sacrificing the accuracy.
AB - Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e. gradient approximation. In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10 times comparing with SOTA almost without sacrificing the accuracy.
KW - coreset selection
KW - data-efficient ml
KW - product quantization
UR - http://www.scopus.com/inward/record.url?scp=85171329343&partnerID=8YFLogxK
U2 - 10.1145/3580305.3599326
DO - 10.1145/3580305.3599326
M3 - Conference contribution
AN - SCOPUS:85171329343
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 167
EP - 178
BT - KDD 2023 - Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023
Y2 - 6 August 2023 through 10 August 2023
ER -