SlimML: Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning

Rui Han; Chi Harold Liu; Shilin Li; Lydia Y. Chen; Guoren Wang; Jian Tang; Jieping Ye

doi:10.1109/TKDE.2019.2951388

SlimML: Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning

Rui Han, Chi Harold Liu^*, Shilin Li, Lydia Y. Chen, Guoren Wang, Jian Tang, Jieping Ye

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

10 Citations (Scopus)

Abstract

The core of many large-scale machine learning (ML) applications, such as neural networks (NN), support vector machine (SVM), and convolutional neural network (CNN), is the training algorithm that iteratively updates model parameters by processing massive datasets. From a plethora of studies aiming at accelerating ML, being data parallelization and parameter server, the prevalent assumption is that all data points are equivalently relevant to model parameter updating. In this article, we challenge this assumption by proposing a criterion to measure a data point's effect on model parameter updating, and experimentally demonstrate that the majority of data points are non-critical in the training process. We develop a slim learning framework, termed SlimML, which trains the ML models only on the critical data and thus significantly improves training performance. To such an end, SlimML efficiently leverages a small number of aggregated data points per iteration to approximate the criticalness of original input data instances. The proposed approach can be used by changing a few lines of code in a standard stochastic gradient descent (SGD) procedure, and we demonstrate experimentally, on NN regression, SVM classification, and CNN training, that for large datasets, it accelerates model training process by an average of 3.61 times while only incurring accuracy losses of 0.37 percent.

Original language	English
Article number	8890886
Pages (from-to)	2223-2236
Number of pages	14
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	33
Issue number	5
DOIs	https://doi.org/10.1109/TKDE.2019.2951388
Publication status	Published - 1 May 2021

Keywords

Iterative machine learning
MapReduce
large input datasets
model parameter updating

Access to Document

10.1109/TKDE.2019.2951388

Cite this

Han, R., Liu, C. H., Li, S., Chen, L. Y., Wang, G., Tang, J., & Ye, J. (2021). SlimML: Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning. IEEE Transactions on Knowledge and Data Engineering, 33(5), 2223-2236. Article 8890886. https://doi.org/10.1109/TKDE.2019.2951388

@article{119306d0ed7943698daffc8c4970ce22,

title = "SlimML: Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning",

abstract = "The core of many large-scale machine learning (ML) applications, such as neural networks (NN), support vector machine (SVM), and convolutional neural network (CNN), is the training algorithm that iteratively updates model parameters by processing massive datasets. From a plethora of studies aiming at accelerating ML, being data parallelization and parameter server, the prevalent assumption is that all data points are equivalently relevant to model parameter updating. In this article, we challenge this assumption by proposing a criterion to measure a data point's effect on model parameter updating, and experimentally demonstrate that the majority of data points are non-critical in the training process. We develop a slim learning framework, termed SlimML, which trains the ML models only on the critical data and thus significantly improves training performance. To such an end, SlimML efficiently leverages a small number of aggregated data points per iteration to approximate the criticalness of original input data instances. The proposed approach can be used by changing a few lines of code in a standard stochastic gradient descent (SGD) procedure, and we demonstrate experimentally, on NN regression, SVM classification, and CNN training, that for large datasets, it accelerates model training process by an average of 3.61 times while only incurring accuracy losses of 0.37 percent.",

keywords = "Iterative machine learning, MapReduce, large input datasets, model parameter updating",

author = "Rui Han and Liu, {Chi Harold} and Shilin Li and Chen, {Lydia Y.} and Guoren Wang and Jian Tang and Jieping Ye",

note = "Publisher Copyright: {\textcopyright} 1989-2012 IEEE.",

year = "2021",

month = may,

day = "1",

doi = "10.1109/TKDE.2019.2951388",

language = "English",

volume = "33",

pages = "2223--2236",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "5",

}

TY - JOUR

T1 - SlimML

T2 - Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning

AU - Han, Rui

AU - Liu, Chi Harold

AU - Li, Shilin

AU - Chen, Lydia Y.

AU - Wang, Guoren

AU - Tang, Jian

AU - Ye, Jieping

PY - 2021/5/1

Y1 - 2021/5/1

N2 - The core of many large-scale machine learning (ML) applications, such as neural networks (NN), support vector machine (SVM), and convolutional neural network (CNN), is the training algorithm that iteratively updates model parameters by processing massive datasets. From a plethora of studies aiming at accelerating ML, being data parallelization and parameter server, the prevalent assumption is that all data points are equivalently relevant to model parameter updating. In this article, we challenge this assumption by proposing a criterion to measure a data point's effect on model parameter updating, and experimentally demonstrate that the majority of data points are non-critical in the training process. We develop a slim learning framework, termed SlimML, which trains the ML models only on the critical data and thus significantly improves training performance. To such an end, SlimML efficiently leverages a small number of aggregated data points per iteration to approximate the criticalness of original input data instances. The proposed approach can be used by changing a few lines of code in a standard stochastic gradient descent (SGD) procedure, and we demonstrate experimentally, on NN regression, SVM classification, and CNN training, that for large datasets, it accelerates model training process by an average of 3.61 times while only incurring accuracy losses of 0.37 percent.

AB - The core of many large-scale machine learning (ML) applications, such as neural networks (NN), support vector machine (SVM), and convolutional neural network (CNN), is the training algorithm that iteratively updates model parameters by processing massive datasets. From a plethora of studies aiming at accelerating ML, being data parallelization and parameter server, the prevalent assumption is that all data points are equivalently relevant to model parameter updating. In this article, we challenge this assumption by proposing a criterion to measure a data point's effect on model parameter updating, and experimentally demonstrate that the majority of data points are non-critical in the training process. We develop a slim learning framework, termed SlimML, which trains the ML models only on the critical data and thus significantly improves training performance. To such an end, SlimML efficiently leverages a small number of aggregated data points per iteration to approximate the criticalness of original input data instances. The proposed approach can be used by changing a few lines of code in a standard stochastic gradient descent (SGD) procedure, and we demonstrate experimentally, on NN regression, SVM classification, and CNN training, that for large datasets, it accelerates model training process by an average of 3.61 times while only incurring accuracy losses of 0.37 percent.

KW - Iterative machine learning

KW - MapReduce

KW - large input datasets

KW - model parameter updating

UR - http://www.scopus.com/inward/record.url?scp=85101702284&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2019.2951388

DO - 10.1109/TKDE.2019.2951388

M3 - Article

AN - SCOPUS:85101702284

SN - 1041-4347

VL - 33

SP - 2223

EP - 2236

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 5

M1 - 8890886

ER -

SlimML: Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this