Selective Data Acquisition in the Wild for Model Charging

Chengliang Chai, Jiabin Liu, Nan Tang, Guoliang Li, Yuyu Luo

科研成果: 期刊稿件会议文章同行评审

摘要

The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging: given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data points from the data in the wild as additional train data that can help the ML task. It consists of two steps (Fig. 1). The first step is to discover relevant datasets (e.g., tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data points they contain can help. The second step is to select which data points from these candidate datasets should be used. We build an end-to-end solution. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called AutoData. It first clusters all data points from candidate datasets such that each cluster contains similar data points from different sources. It then iteratively picks which cluster to use, samples data points (i.e., a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback (i.e., reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show the effectiveness of our solutions.

源语言英语
页(从-至)1466-1478
页数13
期刊Contemporary Mathematics
15
7
DOI
出版状态已出版 - 2022
已对外发布
活动48th International Conference on Very Large Data Bases, VLDB 2022 - Sydney, 澳大利亚
期限: 5 9月 20229 9月 2022

指纹

探究 'Selective Data Acquisition in the Wild for Model Charging' 的科研主题。它们共同构成独一无二的指纹。

引用此