TY - GEN
T1 - Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation
AU - Chai, Chengliang
AU - Jin, Kasisen
AU - Tang, Nan
AU - Fan, Ju
AU - Qiao, Lianpeng
AU - Wang, Yuping
AU - Luo, Yuyu
AU - Yuan, Ye
AU - Wang, Guoren
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - One primary problem for supervised ML is data scarcity, which refers to the inadequacy of well-labeled training data. Recently, deep generative models have shown the capability of generating data objects that closely resemble real data for datasets in different modalities, including images, natural language, and tabular data. Naturally, a promising approach for tackling data scarcity involves training a generative model to produce a collection of data objects, and then employing machine-labeling solutions (e.g., weak supervision or semi-supervised learning) to incorporate these generated data objects for supervised ML. However, it is important to note that because the provided training data may exhibit a different data distribution compared to the validation (or unseen testing) data, the generative model learned from these seen training data cannot guarantee the generation of high-quality data relative to this ML task. To address this challenge, we introduce an iterative approach that gradually calibrates the generative model by interacting with an environment that tells whether generated tuples are good or bad, by using a validation dataset that is not exposed to the generative model. In each iteration, we first use a pre-trained generative model to create unlabeled data objects, label them, and integrate this freshly generated data into the learning process. Afterwards, the model will be tested in the environment to assess the quality of the generated data. The iterative framework can be naturally controlled using reinforcement learning (RL), where an agent generates and labels tuples, an environment tests the generated tuples and sends reward back to the agent to progressively enhance the generative model for a specific supervised ML task. Experimental results over 8 datasets and multiple baselines demonstrate that our RL guided data synthesis, together with off-the-shelf semi-automatic labeling solutions, can significantly improve the performance of supervised ML models.
AB - One primary problem for supervised ML is data scarcity, which refers to the inadequacy of well-labeled training data. Recently, deep generative models have shown the capability of generating data objects that closely resemble real data for datasets in different modalities, including images, natural language, and tabular data. Naturally, a promising approach for tackling data scarcity involves training a generative model to produce a collection of data objects, and then employing machine-labeling solutions (e.g., weak supervision or semi-supervised learning) to incorporate these generated data objects for supervised ML. However, it is important to note that because the provided training data may exhibit a different data distribution compared to the validation (or unseen testing) data, the generative model learned from these seen training data cannot guarantee the generation of high-quality data relative to this ML task. To address this challenge, we introduce an iterative approach that gradually calibrates the generative model by interacting with an environment that tells whether generated tuples are good or bad, by using a validation dataset that is not exposed to the generative model. In each iteration, we first use a pre-trained generative model to create unlabeled data objects, label them, and integrate this freshly generated data into the learning process. Afterwards, the model will be tested in the environment to assess the quality of the generated data. The iterative framework can be naturally controlled using reinforcement learning (RL), where an agent generates and labels tuples, an environment tests the generated tuples and sends reward back to the agent to progressively enhance the generative model for a specific supervised ML task. Experimental results over 8 datasets and multiple baselines demonstrate that our RL guided data synthesis, together with off-the-shelf semi-automatic labeling solutions, can significantly improve the performance of supervised ML models.
UR - http://www.scopus.com/inward/record.url?scp=85200437790&partnerID=8YFLogxK
U2 - 10.1109/ICDE60146.2024.00278
DO - 10.1109/ICDE60146.2024.00278
M3 - Conference contribution
AN - SCOPUS:85200437790
T3 - Proceedings - International Conference on Data Engineering
SP - 3613
EP - 3626
BT - Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024
PB - IEEE Computer Society
T2 - 40th IEEE International Conference on Data Engineering, ICDE 2024
Y2 - 13 May 2024 through 17 May 2024
ER -