Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

Jiayi Wang; Chengliang Chai; Nan Tang; Jiabin Liu; Guoliang Li

doi:10.14778/3561261.3561267

Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, Guoliang Li

科研成果: 期刊稿件 › 会议文章 › 同行评审

12 引用（Scopus）

摘要

Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.

源语言	英语
页（从-至）	64-76
页数	13
期刊	Proceedings of the VLDB Endowment
卷	16
期	1
DOI	https://doi.org/10.14778/3561261.3561267
出版状态	已出版 - 2022
已对外发布	是
活动	49th International Conference on Very Large Data Bases, VLDB 2023 - Vancouver, 加拿大期限: 28 8月 2023 → 1 9月 2023

访问文件

10.14778/3561261.3561267

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{044094add06e4aa09802f0af5f549af6,

title = "Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning",

abstract = "Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.",

author = "Jiayi Wang and Chengliang Chai and Nan Tang and Jiabin Liu and Guoliang Li",

note = "Publisher Copyright: {\textcopyright} 2022 VLDB Endowment.; 49th International Conference on Very Large Data Bases, VLDB 2023 ; Conference date: 28-08-2023 Through 01-09-2023",

year = "2022",

doi = "10.14778/3561261.3561267",

language = "English",

volume = "16",

pages = "64--76",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Very Large Data Base Endowment Inc.",

number = "1",

}

TY - JOUR

T1 - Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

AU - Wang, Jiayi

AU - Chai, Chengliang

AU - Tang, Nan

AU - Liu, Jiabin

AU - Li, Guoliang

PY - 2022

Y1 - 2022

N2 - Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.

AB - Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.

UR - http://www.scopus.com/inward/record.url?scp=85140375247&partnerID=8YFLogxK

U2 - 10.14778/3561261.3561267

DO - 10.14778/3561261.3561267

M3 - Conference article

AN - SCOPUS:85140375247

SN - 2150-8097

VL - 16

SP - 64

EP - 76

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 1

T2 - 49th International Conference on Very Large Data Bases, VLDB 2023

Y2 - 28 August 2023 through 1 September 2023

ER -

Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

摘要

访问文件

其它文件与链接

指纹

引用此