Data Imputation with Limited Data Redundancy Using Data Lakes

  • Chenyu Yang
  • , Yuyu Luo Yuyuluo@Hkust-Gz.Edu.Cn*
  • , Chuanxuan Cui
  • , Ju Fan
  • , Chengliang Chai
  • , Nan Tang
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Data imputation is essential for many data science applications. Existing methods rely heavily on sufficient data redundancy from within-table values. However, many real-world datasets often lack such data redundancy, necessitating external data sources. In this paper, we introduce a retrieval-augmented imputation framework, LakeFill, which combines large language models (LLMs) and data lakes to address this challenge. Unlike existing “table-level” retrieval methods designed for question answering, which retrieve data in the granularity of tables, LakeFill performs fine-grained “tuple-level” retrieval, optimized specifically for data imputation at the tuple level. It encodes (possibly incomplete) tuples to capture nuanced similarities and differences, enabling effective identification of candidate tuples. A novel reranking method that integrates checklist-based training data annotation with stratified training group construction further refines the retrieved tuples. Finally, a reasoner with a novel two-stage confidence-aware imputation ensures reliable imputation results. Extensive experiments show that LakeFill significantly outperforms state-of-the-art methods for data imputation when there is limited data redundancy.

Original languageEnglish
Pages (from-to)3354-3367
Number of pages14
JournalProceedings of the VLDB Endowment
Volume18
Issue number10
DOIs
Publication statusPublished - 2025
Event51st International Conference on Very Large Data Bases, VLDB 2025 - London, United Kingdom
Duration: 1 Sept 20255 Sept 2025

Fingerprint

Dive into the research topics of 'Data Imputation with Limited Data Redundancy Using Data Lakes'. Together they form a unique fingerprint.

Cite this