Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

Jiayi Wang; Chengliang Chai; Nan Tang; Jiabin Liu; Guoliang Li

doi:10.14778/3561261.3561267

Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, Guoliang Li

Research output: Contribution to journal › Conference article › peer-review

12 Citations (Scopus)

Abstract

Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.

Original language	English
Pages (from-to)	64-76
Number of pages	13
Journal	Proceedings of the VLDB Endowment
Volume	16
Issue number	1
DOIs	https://doi.org/10.14778/3561261.3561267
Publication status	Published - 2022
Externally published	Yes
Event	49th International Conference on Very Large Data Bases, VLDB 2023 - Vancouver, Canada Duration: 28 Aug 2023 → 1 Sept 2023

Access to Document

10.14778/3561261.3561267

Cite this

Wang, J., Chai, C., Tang, N., Liu, J., & Li, G. (2022). Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning. Proceedings of the VLDB Endowment, 16(1), 64-76. https://doi.org/10.14778/3561261.3561267

@article{044094add06e4aa09802f0af5f549af6,

title = "Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning",

abstract = "Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.",

author = "Jiayi Wang and Chengliang Chai and Nan Tang and Jiabin Liu and Guoliang Li",

note = "Publisher Copyright: {\textcopyright} 2022 VLDB Endowment.; 49th International Conference on Very Large Data Bases, VLDB 2023 ; Conference date: 28-08-2023 Through 01-09-2023",

year = "2022",

doi = "10.14778/3561261.3561267",

language = "English",

volume = "16",

pages = "64--76",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Very Large Data Base Endowment Inc.",

number = "1",

}

TY - JOUR

T1 - Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

AU - Wang, Jiayi

AU - Chai, Chengliang

AU - Tang, Nan

AU - Liu, Jiabin

AU - Li, Guoliang

PY - 2022

Y1 - 2022

N2 - Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.

AB - Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.

UR - http://www.scopus.com/inward/record.url?scp=85140375247&partnerID=8YFLogxK

U2 - 10.14778/3561261.3561267

DO - 10.14778/3561261.3561267

M3 - Conference article

AN - SCOPUS:85140375247

SN - 2150-8097

VL - 16

SP - 64

EP - 76

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 1

T2 - 49th International Conference on Very Large Data Bases, VLDB 2023

Y2 - 28 August 2023 through 1 September 2023

ER -

Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

Abstract

Access to Document

Other files and links

Fingerprint

Cite this