Selective Data Acquisition in the Wild for Model Charging

Chengliang Chai; Jiabin Liu; Nan Tang; Guoliang Li; Yuyu Luo

doi:10.14778/3523210.3523223

Selective Data Acquisition in the Wild for Model Charging

Chengliang Chai, Jiabin Liu, Nan Tang, Guoliang Li, Yuyu Luo

Research output: Contribution to journal › Conference article › peer-review

33 Citations (Scopus)

Abstract

The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging: given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data points from the data in the wild as additional train data that can help the ML task. It consists of two steps (Fig. 1). The first step is to discover relevant datasets (e.g., tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data points they contain can help. The second step is to select which data points from these candidate datasets should be used. We build an end-to-end solution. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called AutoData. It first clusters all data points from candidate datasets such that each cluster contains similar data points from different sources. It then iteratively picks which cluster to use, samples data points (i.e., a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback (i.e., reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show the effectiveness of our solutions.

Original language	English
Pages (from-to)	1466-1478
Number of pages	13
Journal	Proceedings of the VLDB Endowment
Volume	15
Issue number	7
DOIs	https://doi.org/10.14778/3523210.3523223
Publication status	Published - 2022
Externally published	Yes
Event	48th International Conference on Very Large Data Bases, VLDB 2022 - Sydney, Australia Duration: 5 Sept 2022 → 9 Sept 2022

Access to Document

10.14778/3523210.3523223

Cite this

Chai, C., Liu, J., Tang, N., Li, G., & Luo, Y. (2022). Selective Data Acquisition in the Wild for Model Charging. Proceedings of the VLDB Endowment, 15(7), 1466-1478. https://doi.org/10.14778/3523210.3523223

@article{7d19681b3d1a49c39bf90a65604b5395,

title = "Selective Data Acquisition in the Wild for Model Charging",

abstract = "The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging: given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data points from the data in the wild as additional train data that can help the ML task. It consists of two steps (Fig. 1). The first step is to discover relevant datasets (e.g., tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data points they contain can help. The second step is to select which data points from these candidate datasets should be used. We build an end-to-end solution. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called AutoData. It first clusters all data points from candidate datasets such that each cluster contains similar data points from different sources. It then iteratively picks which cluster to use, samples data points (i.e., a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback (i.e., reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show the effectiveness of our solutions.",

author = "Chengliang Chai and Jiabin Liu and Nan Tang and Guoliang Li and Yuyu Luo",

note = "Publisher Copyright: {\textcopyright} 2022, American Mathematical Society. All rights reserved.; 48th International Conference on Very Large Data Bases, VLDB 2022 ; Conference date: 05-09-2022 Through 09-09-2022",

year = "2022",

doi = "10.14778/3523210.3523223",

language = "English",

volume = "15",

pages = "1466--1478",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Very Large Data Base Endowment Inc.",

number = "7",

}

TY - JOUR

T1 - Selective Data Acquisition in the Wild for Model Charging

AU - Chai, Chengliang

AU - Liu, Jiabin

AU - Tang, Nan

AU - Li, Guoliang

AU - Luo, Yuyu

PY - 2022

Y1 - 2022

N2 - The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging: given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data points from the data in the wild as additional train data that can help the ML task. It consists of two steps (Fig. 1). The first step is to discover relevant datasets (e.g., tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data points they contain can help. The second step is to select which data points from these candidate datasets should be used. We build an end-to-end solution. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called AutoData. It first clusters all data points from candidate datasets such that each cluster contains similar data points from different sources. It then iteratively picks which cluster to use, samples data points (i.e., a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback (i.e., reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show the effectiveness of our solutions.

AB - The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging: given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data points from the data in the wild as additional train data that can help the ML task. It consists of two steps (Fig. 1). The first step is to discover relevant datasets (e.g., tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data points they contain can help. The second step is to select which data points from these candidate datasets should be used. We build an end-to-end solution. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called AutoData. It first clusters all data points from candidate datasets such that each cluster contains similar data points from different sources. It then iteratively picks which cluster to use, samples data points (i.e., a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback (i.e., reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show the effectiveness of our solutions.

UR - http://www.scopus.com/inward/record.url?scp=85136421272&partnerID=8YFLogxK

U2 - 10.14778/3523210.3523223

DO - 10.14778/3523210.3523223

M3 - Conference article

AN - SCOPUS:85136421272

SN - 2150-8097

VL - 15

SP - 1466

EP - 1478

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 7

T2 - 48th International Conference on Very Large Data Bases, VLDB 2022

Y2 - 5 September 2022 through 9 September 2022

ER -

Selective Data Acquisition in the Wild for Model Charging

Abstract

Access to Document

Other files and links

Fingerprint

Cite this