Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, Guoliang Li

科研成果: 期刊稿件会议文章同行评审

12 引用 (Scopus)

摘要

Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.

源语言英语
页(从-至)64-76
页数13
期刊Proceedings of the VLDB Endowment
16
1
DOI
出版状态已出版 - 2022
已对外发布
活动49th International Conference on Very Large Data Bases, VLDB 2023 - Vancouver, 加拿大
期限: 28 8月 20231 9月 2023

指纹

探究 'Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning' 的科研主题。它们共同构成独一无二的指纹。

引用此