TY - JOUR
T1 - Data Management for Machine Learning
T2 - A Survey
AU - Chai, Chengliang
AU - Wang, Jiayi
AU - Luo, Yuyu
AU - Niu, Zeping
AU - Li, Guoliang
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2023/5/1
Y1 - 2023/5/1
N2 - Machine learning (ML) has widespread applications and has revolutionized many industries, but suffers from several challenges. First, sufficient high-quality training data is inevitable for producing a well-performed model, but the data is always human expensive to acquire. Second, a large amount of training data and complicated model structures lead to the inefficiency of training and inference. Third, given an ML task, one always needs to train lots of models, which are hard to manage in real applications. Fortunately, database techniques can benefit ML by addressing the above three challenges. In this paper, we review existing studies from the following three aspects along with the pipeline highly related to ML. (1) Data preparation (Pre-ML): it focuses on preparing high-quality training data that can improve the performance of the ML model, where we review data discovery, data cleaning and data labeling. (2) Model training & inference (In-ML): researchers in ML community focus on improving the model performance during training, while in this survey we mainly study how to accelerate the entire training process, also including feature selection and model selection. (3) Model management (Post-ML): in this part, we survey how to store, query, deploy and debug the models after training. Finally, we provide research challenges and future directions.
AB - Machine learning (ML) has widespread applications and has revolutionized many industries, but suffers from several challenges. First, sufficient high-quality training data is inevitable for producing a well-performed model, but the data is always human expensive to acquire. Second, a large amount of training data and complicated model structures lead to the inefficiency of training and inference. Third, given an ML task, one always needs to train lots of models, which are hard to manage in real applications. Fortunately, database techniques can benefit ML by addressing the above three challenges. In this paper, we review existing studies from the following three aspects along with the pipeline highly related to ML. (1) Data preparation (Pre-ML): it focuses on preparing high-quality training data that can improve the performance of the ML model, where we review data discovery, data cleaning and data labeling. (2) Model training & inference (In-ML): researchers in ML community focus on improving the model performance during training, while in this survey we mainly study how to accelerate the entire training process, also including feature selection and model selection. (3) Model management (Post-ML): in this part, we survey how to store, query, deploy and debug the models after training. Finally, we provide research challenges and future directions.
KW - Database
KW - data preparation
KW - machine learning
KW - model inference
KW - model training
UR - http://www.scopus.com/inward/record.url?scp=85124237590&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2022.3148237
DO - 10.1109/TKDE.2022.3148237
M3 - Article
AN - SCOPUS:85124237590
SN - 1041-4347
VL - 35
SP - 4646
EP - 4667
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 5
ER -