TY - JOUR
T1 - Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning
AU - Wang, Jiayi
AU - Chai, Chengliang
AU - Tang, Nan
AU - Liu, Jiabin
AU - Li, Guoliang
N1 - Publisher Copyright:
© 2022 VLDB Endowment.
PY - 2022
Y1 - 2022
N2 - Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.
AB - Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.
UR - http://www.scopus.com/inward/record.url?scp=85140375247&partnerID=8YFLogxK
U2 - 10.14778/3561261.3561267
DO - 10.14778/3561261.3561267
M3 - Conference article
AN - SCOPUS:85140375247
SN - 2150-8097
VL - 16
SP - 64
EP - 76
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 1
T2 - 49th International Conference on Very Large Data Bases, VLDB 2023
Y2 - 28 August 2023 through 1 September 2023
ER -