Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation

Chengliang Chai; Kasisen Jin; Nan Tang; Ju Fan; Lianpeng Qiao; Yuping Wang; Yuyu Luo; Ye Yuan; Guoren Wang

doi:10.1109/ICDE60146.2024.00278

Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation

Chengliang Chai, Kasisen Jin, Nan Tang, Ju Fan, Lianpeng Qiao^*, Yuping Wang^*, Yuyu Luo, Ye Yuan, Guoren Wang

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

1 引用（Scopus）

摘要

One primary problem for supervised ML is data scarcity, which refers to the inadequacy of well-labeled training data. Recently, deep generative models have shown the capability of generating data objects that closely resemble real data for datasets in different modalities, including images, natural language, and tabular data. Naturally, a promising approach for tackling data scarcity involves training a generative model to produce a collection of data objects, and then employing machine-labeling solutions (e.g., weak supervision or semi-supervised learning) to incorporate these generated data objects for supervised ML. However, it is important to note that because the provided training data may exhibit a different data distribution compared to the validation (or unseen testing) data, the generative model learned from these seen training data cannot guarantee the generation of high-quality data relative to this ML task. To address this challenge, we introduce an iterative approach that gradually calibrates the generative model by interacting with an environment that tells whether generated tuples are good or bad, by using a validation dataset that is not exposed to the generative model. In each iteration, we first use a pre-trained generative model to create unlabeled data objects, label them, and integrate this freshly generated data into the learning process. Afterwards, the model will be tested in the environment to assess the quality of the generated data. The iterative framework can be naturally controlled using reinforcement learning (RL), where an agent generates and labels tuples, an environment tests the generated tuples and sends reward back to the agent to progressively enhance the generative model for a specific supervised ML task. Experimental results over 8 datasets and multiple baselines demonstrate that our RL guided data synthesis, together with off-the-shelf semi-automatic labeling solutions, can significantly improve the performance of supervised ML models.

源语言	英语
主期刊名	Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024
出版商	IEEE Computer Society
页	3613-3626
页数	14
ISBN（电子版）	9798350317152
DOI	https://doi.org/10.1109/ICDE60146.2024.00278
出版状态	已出版 - 2024
活动	40th IEEE International Conference on Data Engineering, ICDE 2024 - Utrecht, 荷兰期限: 13 5月 2024 → 17 5月 2024

出版系列

姓名	Proceedings - International Conference on Data Engineering
ISSN（印刷版）	1084-4627
ISSN（电子版）	2375-0286

会议

会议	40th IEEE International Conference on Data Engineering, ICDE 2024
国家/地区	荷兰
市	Utrecht
时期	13/05/24 → 17/05/24

访问文件

10.1109/ICDE60146.2024.00278

其它文件与链接

链接到 Scopus 的出版物

引用此

Chai, C., Jin, K., Tang, N., Fan, J., Qiao, L., Wang, Y., Luo, Y., Yuan, Y., & Wang, G. (2024). Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation. 在 Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024 (页码 3613-3626). (Proceedings - International Conference on Data Engineering). IEEE Computer Society. https://doi.org/10.1109/ICDE60146.2024.00278

@inproceedings{2c8c38dd7320489e98faed030a6ca584,

title = "Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation",

abstract = "One primary problem for supervised ML is data scarcity, which refers to the inadequacy of well-labeled training data. Recently, deep generative models have shown the capability of generating data objects that closely resemble real data for datasets in different modalities, including images, natural language, and tabular data. Naturally, a promising approach for tackling data scarcity involves training a generative model to produce a collection of data objects, and then employing machine-labeling solutions (e.g., weak supervision or semi-supervised learning) to incorporate these generated data objects for supervised ML. However, it is important to note that because the provided training data may exhibit a different data distribution compared to the validation (or unseen testing) data, the generative model learned from these seen training data cannot guarantee the generation of high-quality data relative to this ML task. To address this challenge, we introduce an iterative approach that gradually calibrates the generative model by interacting with an environment that tells whether generated tuples are good or bad, by using a validation dataset that is not exposed to the generative model. In each iteration, we first use a pre-trained generative model to create unlabeled data objects, label them, and integrate this freshly generated data into the learning process. Afterwards, the model will be tested in the environment to assess the quality of the generated data. The iterative framework can be naturally controlled using reinforcement learning (RL), where an agent generates and labels tuples, an environment tests the generated tuples and sends reward back to the agent to progressively enhance the generative model for a specific supervised ML task. Experimental results over 8 datasets and multiple baselines demonstrate that our RL guided data synthesis, together with off-the-shelf semi-automatic labeling solutions, can significantly improve the performance of supervised ML models.",

author = "Chengliang Chai and Kasisen Jin and Nan Tang and Ju Fan and Lianpeng Qiao and Yuping Wang and Yuyu Luo and Ye Yuan and Guoren Wang",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 40th IEEE International Conference on Data Engineering, ICDE 2024 ; Conference date: 13-05-2024 Through 17-05-2024",

year = "2024",

doi = "10.1109/ICDE60146.2024.00278",

language = "English",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE Computer Society",

pages = "3613--3626",

booktitle = "Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024",

address = "United States",

}

Chai, C, Jin, K, Tang, N, Fan, J, Qiao, L, Wang, Y, Luo, Y, Yuan, Y & Wang, G 2024, Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation. 在 Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024. Proceedings - International Conference on Data Engineering, IEEE Computer Society, 页码 3613-3626, 40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, 荷兰, 13/05/24. https://doi.org/10.1109/ICDE60146.2024.00278

Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation. / Chai, Chengliang; Jin, Kasisen; Tang, Nan 等.
Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024. IEEE Computer Society, 2024. 页码 3613-3626 (Proceedings - International Conference on Data Engineering).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation

AU - Chai, Chengliang

AU - Jin, Kasisen

AU - Tang, Nan

AU - Fan, Ju

AU - Qiao, Lianpeng

AU - Wang, Yuping

AU - Luo, Yuyu

AU - Yuan, Ye

AU - Wang, Guoren

PY - 2024

Y1 - 2024

N2 - One primary problem for supervised ML is data scarcity, which refers to the inadequacy of well-labeled training data. Recently, deep generative models have shown the capability of generating data objects that closely resemble real data for datasets in different modalities, including images, natural language, and tabular data. Naturally, a promising approach for tackling data scarcity involves training a generative model to produce a collection of data objects, and then employing machine-labeling solutions (e.g., weak supervision or semi-supervised learning) to incorporate these generated data objects for supervised ML. However, it is important to note that because the provided training data may exhibit a different data distribution compared to the validation (or unseen testing) data, the generative model learned from these seen training data cannot guarantee the generation of high-quality data relative to this ML task. To address this challenge, we introduce an iterative approach that gradually calibrates the generative model by interacting with an environment that tells whether generated tuples are good or bad, by using a validation dataset that is not exposed to the generative model. In each iteration, we first use a pre-trained generative model to create unlabeled data objects, label them, and integrate this freshly generated data into the learning process. Afterwards, the model will be tested in the environment to assess the quality of the generated data. The iterative framework can be naturally controlled using reinforcement learning (RL), where an agent generates and labels tuples, an environment tests the generated tuples and sends reward back to the agent to progressively enhance the generative model for a specific supervised ML task. Experimental results over 8 datasets and multiple baselines demonstrate that our RL guided data synthesis, together with off-the-shelf semi-automatic labeling solutions, can significantly improve the performance of supervised ML models.

AB - One primary problem for supervised ML is data scarcity, which refers to the inadequacy of well-labeled training data. Recently, deep generative models have shown the capability of generating data objects that closely resemble real data for datasets in different modalities, including images, natural language, and tabular data. Naturally, a promising approach for tackling data scarcity involves training a generative model to produce a collection of data objects, and then employing machine-labeling solutions (e.g., weak supervision or semi-supervised learning) to incorporate these generated data objects for supervised ML. However, it is important to note that because the provided training data may exhibit a different data distribution compared to the validation (or unseen testing) data, the generative model learned from these seen training data cannot guarantee the generation of high-quality data relative to this ML task. To address this challenge, we introduce an iterative approach that gradually calibrates the generative model by interacting with an environment that tells whether generated tuples are good or bad, by using a validation dataset that is not exposed to the generative model. In each iteration, we first use a pre-trained generative model to create unlabeled data objects, label them, and integrate this freshly generated data into the learning process. Afterwards, the model will be tested in the environment to assess the quality of the generated data. The iterative framework can be naturally controlled using reinforcement learning (RL), where an agent generates and labels tuples, an environment tests the generated tuples and sends reward back to the agent to progressively enhance the generative model for a specific supervised ML task. Experimental results over 8 datasets and multiple baselines demonstrate that our RL guided data synthesis, together with off-the-shelf semi-automatic labeling solutions, can significantly improve the performance of supervised ML models.

UR - http://www.scopus.com/inward/record.url?scp=85200437790&partnerID=8YFLogxK

U2 - 10.1109/ICDE60146.2024.00278

DO - 10.1109/ICDE60146.2024.00278

M3 - Conference contribution

AN - SCOPUS:85200437790

T3 - Proceedings - International Conference on Data Engineering

SP - 3613

EP - 3626

BT - Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024

PB - IEEE Computer Society

T2 - 40th IEEE International Conference on Data Engineering, ICDE 2024

Y2 - 13 May 2024 through 17 May 2024

ER -

Chai C, Jin K, Tang N, Fan J, Qiao L, Wang Y 等. Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation. 在 Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024. IEEE Computer Society. 2024. 页码 3613-3626. (Proceedings - International Conference on Data Engineering). doi: 10.1109/ICDE60146.2024.00278

Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此