TY - GEN
T1 - Demystifying Artificial Intelligence for Data Preparation
AU - Chai, Chengliang
AU - Tang, Nan
AU - Fan, Ju
AU - Luo, Yuyu
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/6/4
Y1 - 2023/6/4
N2 - Data preparation - the process of discovering, integrating, transforming, cleaning, and annotating data - is one of the oldest, hardest, yet inevitable data management problems. Unfortunately, data preparation is known to be iterative, requires high human cost, and is error-prone. Recent advances in artificial intelligence (AI) have shown very promising results on many data preparation tasks. At a high level, AI for data preparation (AI4DP) should have the following abilities. First, the AI model should capture real-world knowledge so as to solve various tasks. Second, it is important to easily adapt to new datasets/tasks. Third, data preparation is a complicated pipeline with many operations, which results in a large number of candidates to select the optimum, and thus it is crucial to effectively and efficiently explore the large space of possible pipelines. In this tutorial, we will cover three important topics to address the above issues: demystifying foundation models to inject knowledge for data preparation, tuning and adapting pre-trained language models for data preparation, and orchestrating data preparation pipelines for different downstream applications.
AB - Data preparation - the process of discovering, integrating, transforming, cleaning, and annotating data - is one of the oldest, hardest, yet inevitable data management problems. Unfortunately, data preparation is known to be iterative, requires high human cost, and is error-prone. Recent advances in artificial intelligence (AI) have shown very promising results on many data preparation tasks. At a high level, AI for data preparation (AI4DP) should have the following abilities. First, the AI model should capture real-world knowledge so as to solve various tasks. Second, it is important to easily adapt to new datasets/tasks. Third, data preparation is a complicated pipeline with many operations, which results in a large number of candidates to select the optimum, and thus it is crucial to effectively and efficiently explore the large space of possible pipelines. In this tutorial, we will cover three important topics to address the above issues: demystifying foundation models to inject knowledge for data preparation, tuning and adapting pre-trained language models for data preparation, and orchestrating data preparation pipelines for different downstream applications.
KW - artificial intelligence
KW - data preparation
KW - foundation models
UR - http://www.scopus.com/inward/record.url?scp=85162854658&partnerID=8YFLogxK
U2 - 10.1145/3555041.3589406
DO - 10.1145/3555041.3589406
M3 - Conference contribution
AN - SCOPUS:85162854658
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 13
EP - 20
BT - SIGMOD 2023 - Companion of the 2023 ACM/SIGMOD International Conference on Management of Data
PB - Association for Computing Machinery
T2 - 2023 ACM/SIGMOD International Conference on Management of Data, SIGMOD 2023
Y2 - 18 June 2023 through 23 June 2023
ER -