LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes

Yuhao Deng, Chengliang Chai, Lei Cao, Qin Yuan, Siyuan Chen, Yanrui Yu, Zhaoze Sun, Junyi Wang, Jiajun Li, Ziqi Cao, Kaisen Jin, Chi Zhang, Yuqing Jiang, Yuanfang Zhang, Yuping Wang, Ye Yuan, Guoren Wang, Nan Tang

科研成果: 期刊稿件会议文章同行评审

1 引用 (Scopus)

摘要

Discovering tables from poorly maintained data lakes is a signifcant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there’s a lack of a comprehensive benchmark for evaluating existing methods. To address this, we introduce LakeBench, a large-scale table discovery benchmark. It evaluates efectiveness, efciency, and scalability of table join & union search methods. With over 16 million real tables, LakeBench is 1,600X larger than existing datasets and 100X larger in storage size. It includes synthesized and real queries with ground truth, totaling more than 10 thousand queries – 10X more than used in any existing evaluation. We spent over 7,500 human hours labeling these queries and constructing diverse query categories for thorough evaluation. Our benchmark thoroughly evaluates stateof-the-art table discovery methods, providing insights into their performance and highlighting research opportunities.

源语言英语
页(从-至)1925-1938
页数14
期刊Proceedings of the VLDB Endowment
17
8
DOI
出版状态已出版 - 2024
活动50th International Conference on Very Large Data Bases, VLDB 2024 - Guangzhou, 中国
期限: 24 8月 202429 8月 2024

指纹

探究 'LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes' 的科研主题。它们共同构成独一无二的指纹。

引用此