Demystifying Artificial Intelligence for Data Preparation

Chengliang Chai; Nan Tang; Ju Fan; Yuyu Luo

doi:10.1145/3555041.3589406

Demystifying Artificial Intelligence for Data Preparation

Chengliang Chai, Nan Tang, Ju Fan, Yuyu Luo

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Citations (Scopus)

Abstract

Data preparation - the process of discovering, integrating, transforming, cleaning, and annotating data - is one of the oldest, hardest, yet inevitable data management problems. Unfortunately, data preparation is known to be iterative, requires high human cost, and is error-prone. Recent advances in artificial intelligence (AI) have shown very promising results on many data preparation tasks. At a high level, AI for data preparation (AI4DP) should have the following abilities. First, the AI model should capture real-world knowledge so as to solve various tasks. Second, it is important to easily adapt to new datasets/tasks. Third, data preparation is a complicated pipeline with many operations, which results in a large number of candidates to select the optimum, and thus it is crucial to effectively and efficiently explore the large space of possible pipelines. In this tutorial, we will cover three important topics to address the above issues: demystifying foundation models to inject knowledge for data preparation, tuning and adapting pre-trained language models for data preparation, and orchestrating data preparation pipelines for different downstream applications.

Original language	English
Title of host publication	SIGMOD 2023 - Companion of the 2023 ACM/SIGMOD International Conference on Management of Data
Publisher	Association for Computing Machinery
Pages	13-20
Number of pages	8
ISBN (Electronic)	9781450395076
DOIs	https://doi.org/10.1145/3555041.3589406
Publication status	Published - 4 Jun 2023
Event	2023 ACM/SIGMOD International Conference on Management of Data, SIGMOD 2023 - Seattle, United States Duration: 18 Jun 2023 → 23 Jun 2023

Publication series

Name	Proceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)	0730-8078

Conference

Conference	2023 ACM/SIGMOD International Conference on Management of Data, SIGMOD 2023
Country/Territory	United States
City	Seattle
Period	18/06/23 → 23/06/23

Keywords

artificial intelligence
data preparation
foundation models

Access to Document

10.1145/3555041.3589406

Cite this

Chai, C., Tang, N., Fan, J., & Luo, Y. (2023). Demystifying Artificial Intelligence for Data Preparation. In SIGMOD 2023 - Companion of the 2023 ACM/SIGMOD International Conference on Management of Data (pp. 13-20). (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/3555041.3589406

@inproceedings{1b9db9e61d12443899482157837540ae,

title = "Demystifying Artificial Intelligence for Data Preparation",

abstract = "Data preparation - the process of discovering, integrating, transforming, cleaning, and annotating data - is one of the oldest, hardest, yet inevitable data management problems. Unfortunately, data preparation is known to be iterative, requires high human cost, and is error-prone. Recent advances in artificial intelligence (AI) have shown very promising results on many data preparation tasks. At a high level, AI for data preparation (AI4DP) should have the following abilities. First, the AI model should capture real-world knowledge so as to solve various tasks. Second, it is important to easily adapt to new datasets/tasks. Third, data preparation is a complicated pipeline with many operations, which results in a large number of candidates to select the optimum, and thus it is crucial to effectively and efficiently explore the large space of possible pipelines. In this tutorial, we will cover three important topics to address the above issues: demystifying foundation models to inject knowledge for data preparation, tuning and adapting pre-trained language models for data preparation, and orchestrating data preparation pipelines for different downstream applications.",

keywords = "artificial intelligence, data preparation, foundation models",

author = "Chengliang Chai and Nan Tang and Ju Fan and Yuyu Luo",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 2023 ACM/SIGMOD International Conference on Management of Data, SIGMOD 2023 ; Conference date: 18-06-2023 Through 23-06-2023",

year = "2023",

month = jun,

day = "4",

doi = "10.1145/3555041.3589406",

language = "English",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

publisher = "Association for Computing Machinery",

pages = "13--20",

booktitle = "SIGMOD 2023 - Companion of the 2023 ACM/SIGMOD International Conference on Management of Data",

}

Chai, C, Tang, N, Fan, J & Luo, Y 2023, Demystifying Artificial Intelligence for Data Preparation. in SIGMOD 2023 - Companion of the 2023 ACM/SIGMOD International Conference on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, pp. 13-20, 2023 ACM/SIGMOD International Conference on Management of Data, SIGMOD 2023, Seattle, United States, 18/06/23. https://doi.org/10.1145/3555041.3589406

Demystifying Artificial Intelligence for Data Preparation. / Chai, Chengliang; Tang, Nan; Fan, Ju et al.
SIGMOD 2023 - Companion of the 2023 ACM/SIGMOD International Conference on Management of Data. Association for Computing Machinery, 2023. p. 13-20 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Demystifying Artificial Intelligence for Data Preparation

AU - Chai, Chengliang

AU - Tang, Nan

AU - Fan, Ju

AU - Luo, Yuyu

PY - 2023/6/4

Y1 - 2023/6/4

N2 - Data preparation - the process of discovering, integrating, transforming, cleaning, and annotating data - is one of the oldest, hardest, yet inevitable data management problems. Unfortunately, data preparation is known to be iterative, requires high human cost, and is error-prone. Recent advances in artificial intelligence (AI) have shown very promising results on many data preparation tasks. At a high level, AI for data preparation (AI4DP) should have the following abilities. First, the AI model should capture real-world knowledge so as to solve various tasks. Second, it is important to easily adapt to new datasets/tasks. Third, data preparation is a complicated pipeline with many operations, which results in a large number of candidates to select the optimum, and thus it is crucial to effectively and efficiently explore the large space of possible pipelines. In this tutorial, we will cover three important topics to address the above issues: demystifying foundation models to inject knowledge for data preparation, tuning and adapting pre-trained language models for data preparation, and orchestrating data preparation pipelines for different downstream applications.

AB - Data preparation - the process of discovering, integrating, transforming, cleaning, and annotating data - is one of the oldest, hardest, yet inevitable data management problems. Unfortunately, data preparation is known to be iterative, requires high human cost, and is error-prone. Recent advances in artificial intelligence (AI) have shown very promising results on many data preparation tasks. At a high level, AI for data preparation (AI4DP) should have the following abilities. First, the AI model should capture real-world knowledge so as to solve various tasks. Second, it is important to easily adapt to new datasets/tasks. Third, data preparation is a complicated pipeline with many operations, which results in a large number of candidates to select the optimum, and thus it is crucial to effectively and efficiently explore the large space of possible pipelines. In this tutorial, we will cover three important topics to address the above issues: demystifying foundation models to inject knowledge for data preparation, tuning and adapting pre-trained language models for data preparation, and orchestrating data preparation pipelines for different downstream applications.

KW - artificial intelligence

KW - data preparation

KW - foundation models

UR - http://www.scopus.com/inward/record.url?scp=85162854658&partnerID=8YFLogxK

U2 - 10.1145/3555041.3589406

DO - 10.1145/3555041.3589406

M3 - Conference contribution

AN - SCOPUS:85162854658

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 13

EP - 20

BT - SIGMOD 2023 - Companion of the 2023 ACM/SIGMOD International Conference on Management of Data

PB - Association for Computing Machinery

T2 - 2023 ACM/SIGMOD International Conference on Management of Data, SIGMOD 2023

Y2 - 18 June 2023 through 23 June 2023

ER -

Demystifying Artificial Intelligence for Data Preparation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this