TY - JOUR
T1 - SlimML
T2 - Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning
AU - Han, Rui
AU - Liu, Chi Harold
AU - Li, Shilin
AU - Chen, Lydia Y.
AU - Wang, Guoren
AU - Tang, Jian
AU - Ye, Jieping
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2021/5/1
Y1 - 2021/5/1
N2 - The core of many large-scale machine learning (ML) applications, such as neural networks (NN), support vector machine (SVM), and convolutional neural network (CNN), is the training algorithm that iteratively updates model parameters by processing massive datasets. From a plethora of studies aiming at accelerating ML, being data parallelization and parameter server, the prevalent assumption is that all data points are equivalently relevant to model parameter updating. In this article, we challenge this assumption by proposing a criterion to measure a data point's effect on model parameter updating, and experimentally demonstrate that the majority of data points are non-critical in the training process. We develop a slim learning framework, termed SlimML, which trains the ML models only on the critical data and thus significantly improves training performance. To such an end, SlimML efficiently leverages a small number of aggregated data points per iteration to approximate the criticalness of original input data instances. The proposed approach can be used by changing a few lines of code in a standard stochastic gradient descent (SGD) procedure, and we demonstrate experimentally, on NN regression, SVM classification, and CNN training, that for large datasets, it accelerates model training process by an average of 3.61 times while only incurring accuracy losses of 0.37 percent.
AB - The core of many large-scale machine learning (ML) applications, such as neural networks (NN), support vector machine (SVM), and convolutional neural network (CNN), is the training algorithm that iteratively updates model parameters by processing massive datasets. From a plethora of studies aiming at accelerating ML, being data parallelization and parameter server, the prevalent assumption is that all data points are equivalently relevant to model parameter updating. In this article, we challenge this assumption by proposing a criterion to measure a data point's effect on model parameter updating, and experimentally demonstrate that the majority of data points are non-critical in the training process. We develop a slim learning framework, termed SlimML, which trains the ML models only on the critical data and thus significantly improves training performance. To such an end, SlimML efficiently leverages a small number of aggregated data points per iteration to approximate the criticalness of original input data instances. The proposed approach can be used by changing a few lines of code in a standard stochastic gradient descent (SGD) procedure, and we demonstrate experimentally, on NN regression, SVM classification, and CNN training, that for large datasets, it accelerates model training process by an average of 3.61 times while only incurring accuracy losses of 0.37 percent.
KW - Iterative machine learning
KW - MapReduce
KW - large input datasets
KW - model parameter updating
UR - http://www.scopus.com/inward/record.url?scp=85101702284&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2019.2951388
DO - 10.1109/TKDE.2019.2951388
M3 - Article
AN - SCOPUS:85101702284
SN - 1041-4347
VL - 33
SP - 2223
EP - 2236
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 5
M1 - 8890886
ER -