TY - JOUR
T1 - Data Imputation with Limited Data Redundancy Using Data Lakes
AU - Yang, Chenyu
AU - Luo Yuyuluo@Hkust-Gz.Edu.Cn, Yuyu
AU - Cui, Chuanxuan
AU - Fan, Ju
AU - Chai, Chengliang
AU - Tang, Nan
N1 - Publisher Copyright:
© 2025, VLDB Endowment. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Data imputation is essential for many data science applications. Existing methods rely heavily on sufficient data redundancy from within-table values. However, many real-world datasets often lack such data redundancy, necessitating external data sources. In this paper, we introduce a retrieval-augmented imputation framework, LakeFill, which combines large language models (LLMs) and data lakes to address this challenge. Unlike existing “table-level” retrieval methods designed for question answering, which retrieve data in the granularity of tables, LakeFill performs fine-grained “tuple-level” retrieval, optimized specifically for data imputation at the tuple level. It encodes (possibly incomplete) tuples to capture nuanced similarities and differences, enabling effective identification of candidate tuples. A novel reranking method that integrates checklist-based training data annotation with stratified training group construction further refines the retrieved tuples. Finally, a reasoner with a novel two-stage confidence-aware imputation ensures reliable imputation results. Extensive experiments show that LakeFill significantly outperforms state-of-the-art methods for data imputation when there is limited data redundancy.
AB - Data imputation is essential for many data science applications. Existing methods rely heavily on sufficient data redundancy from within-table values. However, many real-world datasets often lack such data redundancy, necessitating external data sources. In this paper, we introduce a retrieval-augmented imputation framework, LakeFill, which combines large language models (LLMs) and data lakes to address this challenge. Unlike existing “table-level” retrieval methods designed for question answering, which retrieve data in the granularity of tables, LakeFill performs fine-grained “tuple-level” retrieval, optimized specifically for data imputation at the tuple level. It encodes (possibly incomplete) tuples to capture nuanced similarities and differences, enabling effective identification of candidate tuples. A novel reranking method that integrates checklist-based training data annotation with stratified training group construction further refines the retrieved tuples. Finally, a reasoner with a novel two-stage confidence-aware imputation ensures reliable imputation results. Extensive experiments show that LakeFill significantly outperforms state-of-the-art methods for data imputation when there is limited data redundancy.
UR - https://www.scopus.com/pages/publications/105021395992
U2 - 10.14778/3748191.3748200
DO - 10.14778/3748191.3748200
M3 - Conference article
AN - SCOPUS:105021395992
SN - 2150-8097
VL - 18
SP - 3354
EP - 3367
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 10
T2 - 51st International Conference on Very Large Data Bases, VLDB 2025
Y2 - 1 September 2025 through 5 September 2025
ER -