Data Management for Machine Learning: A Survey

Chengliang Chai; Jiayi Wang; Yuyu Luo; Zeping Niu; Guoliang Li

doi:10.1109/TKDE.2022.3148237

Data Management for Machine Learning: A Survey

Chengliang Chai, Jiayi Wang, Yuyu Luo^*, Zeping Niu, Guoliang Li^*

^*Corresponding author for this work

Tsinghua University

Research output: Contribution to journal › Article › peer-review

39 Citations (Scopus)

Abstract

Machine learning (ML) has widespread applications and has revolutionized many industries, but suffers from several challenges. First, sufficient high-quality training data is inevitable for producing a well-performed model, but the data is always human expensive to acquire. Second, a large amount of training data and complicated model structures lead to the inefficiency of training and inference. Third, given an ML task, one always needs to train lots of models, which are hard to manage in real applications. Fortunately, database techniques can benefit ML by addressing the above three challenges. In this paper, we review existing studies from the following three aspects along with the pipeline highly related to ML. (1) Data preparation (Pre-ML): it focuses on preparing high-quality training data that can improve the performance of the ML model, where we review data discovery, data cleaning and data labeling. (2) Model training & inference (In-ML): researchers in ML community focus on improving the model performance during training, while in this survey we mainly study how to accelerate the entire training process, also including feature selection and model selection. (3) Model management (Post-ML): in this part, we survey how to store, query, deploy and debug the models after training. Finally, we provide research challenges and future directions.

Original language	English
Pages (from-to)	4646-4667
Number of pages	22
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	35
Issue number	5
DOIs	https://doi.org/10.1109/TKDE.2022.3148237
Publication status	Published - 1 May 2023
Externally published	Yes

Keywords

Database
data preparation
machine learning
model inference
model training

Access to Document

10.1109/TKDE.2022.3148237

Cite this

Chai, C., Wang, J., Luo, Y., Niu, Z., & Li, G. (2023). Data Management for Machine Learning: A Survey. IEEE Transactions on Knowledge and Data Engineering, 35(5), 4646-4667. https://doi.org/10.1109/TKDE.2022.3148237

@article{404f435882ba4356bf2378d8880c0001,

title = "Data Management for Machine Learning: A Survey",

abstract = "Machine learning (ML) has widespread applications and has revolutionized many industries, but suffers from several challenges. First, sufficient high-quality training data is inevitable for producing a well-performed model, but the data is always human expensive to acquire. Second, a large amount of training data and complicated model structures lead to the inefficiency of training and inference. Third, given an ML task, one always needs to train lots of models, which are hard to manage in real applications. Fortunately, database techniques can benefit ML by addressing the above three challenges. In this paper, we review existing studies from the following three aspects along with the pipeline highly related to ML. (1) Data preparation (Pre-ML): it focuses on preparing high-quality training data that can improve the performance of the ML model, where we review data discovery, data cleaning and data labeling. (2) Model training & inference (In-ML): researchers in ML community focus on improving the model performance during training, while in this survey we mainly study how to accelerate the entire training process, also including feature selection and model selection. (3) Model management (Post-ML): in this part, we survey how to store, query, deploy and debug the models after training. Finally, we provide research challenges and future directions.",

keywords = "Database, data preparation, machine learning, model inference, model training",

author = "Chengliang Chai and Jiayi Wang and Yuyu Luo and Zeping Niu and Guoliang Li",

note = "Publisher Copyright: {\textcopyright} 1989-2012 IEEE.",

year = "2023",

month = may,

day = "1",

doi = "10.1109/TKDE.2022.3148237",

language = "English",

volume = "35",

pages = "4646--4667",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "5",

}

TY - JOUR

T1 - Data Management for Machine Learning

T2 - A Survey

AU - Chai, Chengliang

AU - Wang, Jiayi

AU - Luo, Yuyu

AU - Niu, Zeping

AU - Li, Guoliang

PY - 2023/5/1

Y1 - 2023/5/1

N2 - Machine learning (ML) has widespread applications and has revolutionized many industries, but suffers from several challenges. First, sufficient high-quality training data is inevitable for producing a well-performed model, but the data is always human expensive to acquire. Second, a large amount of training data and complicated model structures lead to the inefficiency of training and inference. Third, given an ML task, one always needs to train lots of models, which are hard to manage in real applications. Fortunately, database techniques can benefit ML by addressing the above three challenges. In this paper, we review existing studies from the following three aspects along with the pipeline highly related to ML. (1) Data preparation (Pre-ML): it focuses on preparing high-quality training data that can improve the performance of the ML model, where we review data discovery, data cleaning and data labeling. (2) Model training & inference (In-ML): researchers in ML community focus on improving the model performance during training, while in this survey we mainly study how to accelerate the entire training process, also including feature selection and model selection. (3) Model management (Post-ML): in this part, we survey how to store, query, deploy and debug the models after training. Finally, we provide research challenges and future directions.

AB - Machine learning (ML) has widespread applications and has revolutionized many industries, but suffers from several challenges. First, sufficient high-quality training data is inevitable for producing a well-performed model, but the data is always human expensive to acquire. Second, a large amount of training data and complicated model structures lead to the inefficiency of training and inference. Third, given an ML task, one always needs to train lots of models, which are hard to manage in real applications. Fortunately, database techniques can benefit ML by addressing the above three challenges. In this paper, we review existing studies from the following three aspects along with the pipeline highly related to ML. (1) Data preparation (Pre-ML): it focuses on preparing high-quality training data that can improve the performance of the ML model, where we review data discovery, data cleaning and data labeling. (2) Model training & inference (In-ML): researchers in ML community focus on improving the model performance during training, while in this survey we mainly study how to accelerate the entire training process, also including feature selection and model selection. (3) Model management (Post-ML): in this part, we survey how to store, query, deploy and debug the models after training. Finally, we provide research challenges and future directions.

KW - Database

KW - data preparation

KW - machine learning

KW - model inference

KW - model training

UR - http://www.scopus.com/inward/record.url?scp=85124237590&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2022.3148237

DO - 10.1109/TKDE.2022.3148237

M3 - Article

AN - SCOPUS:85124237590

SN - 1041-4347

VL - 35

SP - 4646

EP - 4667

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 5

ER -

Data Management for Machine Learning: A Survey

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this