Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning

Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, Guoliang Li

Research output: Contribution to journalConference articlepeer-review

12 Citations (Scopus)
Plum Print visual indicator of research metrics
  • Citations
    • Patent Family Citations: 2
    • Citation Indexes: 12
  • Captures
    • Readers: 17
  • Mentions
    • News Mentions: 1
see details

Abstract

Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-ine_cient. Coreset is often used to achieve data-e_cient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming. In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-e_cient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to e_ciently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for corset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the e_ciency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.

Original languageEnglish
Pages (from-to)64-76
Number of pages13
JournalProceedings of the VLDB Endowment
Volume16
Issue number1
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event49th International Conference on Very Large Data Bases, VLDB 2023 - Vancouver, Canada
Duration: 28 Aug 20231 Sept 2023

Fingerprint

Dive into the research topics of 'Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning'. Together they form a unique fingerprint.

Cite this

Wang, J., Chai, C., Tang, N., Liu, J., & Li, G. (2022). Coresets over Multiple Tables for Feature-rich and Data-e_icient Machine Learning. Proceedings of the VLDB Endowment, 16(1), 64-76. https://doi.org/10.14778/3561261.3561267