Efficient Coreset Selection with Cluster-based Methods

Chengliang Chai; Jiayi Wang; Nan Tang; Ye Yuan; Jiabin Liu; Yuhao Deng; Guoren Wang

doi:10.1145/3580305.3599326

Efficient Coreset Selection with Cluster-based Methods

Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, Guoren Wang

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

9 引用（Scopus）

摘要

Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e. gradient approximation. In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10 times comparing with SOTA almost without sacrificing the accuracy.

源语言	英语
主期刊名	KDD 2023 - Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
出版商	Association for Computing Machinery
页	167-178
页数	12
ISBN（电子版）	9798400701030
DOI	https://doi.org/10.1145/3580305.3599326
出版状态	已出版 - 4 8月 2023
活动	29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023 - Long Beach, 美国期限: 6 8月 2023 → 10 8月 2023

出版系列

姓名	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
ISSN（印刷版）	2154-817X

会议

会议	29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023
国家/地区	美国
市	Long Beach
时期	6/08/23 → 10/08/23

访问文件

10.1145/3580305.3599326

其它文件与链接

链接到 Scopus 的出版物

引用此

Chai, C., Wang, J., Tang, N., Yuan, Y., Liu, J., Deng, Y., & Wang, G. (2023). Efficient Coreset Selection with Cluster-based Methods. 在 KDD 2023 - Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (页码 167-178). (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining). Association for Computing Machinery. https://doi.org/10.1145/3580305.3599326

@inproceedings{d0fe3a0f01824d9292d2e84fbcb69f4b,

title = "Efficient Coreset Selection with Cluster-based Methods",

abstract = "Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e. gradient approximation. In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10 times comparing with SOTA almost without sacrificing the accuracy.",

keywords = "coreset selection, data-efficient ml, product quantization",

author = "Chengliang Chai and Jiayi Wang and Nan Tang and Ye Yuan and Jiabin Liu and Yuhao Deng and Guoren Wang",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023 ; Conference date: 06-08-2023 Through 10-08-2023",

year = "2023",

month = aug,

day = "4",

doi = "10.1145/3580305.3599326",

language = "English",

series = "Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",

publisher = "Association for Computing Machinery",

pages = "167--178",

booktitle = "KDD 2023 - Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining",

}

Chai, C, Wang, J, Tang, N, Yuan, Y, Liu, J, Deng, Y & Wang, G 2023, Efficient Coreset Selection with Cluster-based Methods. 在 KDD 2023 - Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, 页码 167-178, 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, 美国, 6/08/23. https://doi.org/10.1145/3580305.3599326

Efficient Coreset Selection with Cluster-based Methods. / Chai, Chengliang; Wang, Jiayi; Tang, Nan 等.
KDD 2023 - Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2023. 页码 167-178 (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Efficient Coreset Selection with Cluster-based Methods

AU - Chai, Chengliang

AU - Wang, Jiayi

AU - Tang, Nan

AU - Yuan, Ye

AU - Liu, Jiabin

AU - Deng, Yuhao

AU - Wang, Guoren

PY - 2023/8/4

Y1 - 2023/8/4

N2 - Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e. gradient approximation. In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10 times comparing with SOTA almost without sacrificing the accuracy.

AB - Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e. gradient approximation. In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10 times comparing with SOTA almost without sacrificing the accuracy.

KW - coreset selection

KW - data-efficient ml

KW - product quantization

UR - http://www.scopus.com/inward/record.url?scp=85171329343&partnerID=8YFLogxK

U2 - 10.1145/3580305.3599326

DO - 10.1145/3580305.3599326

M3 - Conference contribution

AN - SCOPUS:85171329343

T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

SP - 167

EP - 178

BT - KDD 2023 - Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

PB - Association for Computing Machinery

T2 - 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023

Y2 - 6 August 2023 through 10 August 2023

ER -

Chai C, Wang J, Tang N, Yuan Y, Liu J, Deng Y 等. Efficient Coreset Selection with Cluster-based Methods. 在 KDD 2023 - Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery. 2023. 页码 167-178. (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining). doi: 10.1145/3580305.3599326

Efficient Coreset Selection with Cluster-based Methods

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此