TY - JOUR
T1 - LakeBench
T2 - 50th International Conference on Very Large Data Bases, VLDB 2024
AU - Deng, Yuhao
AU - Chai, Chengliang
AU - Cao, Lei
AU - Yuan, Qin
AU - Chen, Siyuan
AU - Yu, Yanrui
AU - Sun, Zhaoze
AU - Wang, Junyi
AU - Li, Jiajun
AU - Cao, Ziqi
AU - Jin, Kaisen
AU - Zhang, Chi
AU - Jiang, Yuqing
AU - Zhang, Yuanfang
AU - Wang, Yuping
AU - Yuan, Ye
AU - Wang, Guoren
AU - Tang, Nan
N1 - Publisher Copyright:
© 2024, VLDB Endowment. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Discovering tables from poorly maintained data lakes is a signifcant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there’s a lack of a comprehensive benchmark for evaluating existing methods. To address this, we introduce LakeBench, a large-scale table discovery benchmark. It evaluates efectiveness, efciency, and scalability of table join & union search methods. With over 16 million real tables, LakeBench is 1,600X larger than existing datasets and 100X larger in storage size. It includes synthesized and real queries with ground truth, totaling more than 10 thousand queries – 10X more than used in any existing evaluation. We spent over 7,500 human hours labeling these queries and constructing diverse query categories for thorough evaluation. Our benchmark thoroughly evaluates stateof-the-art table discovery methods, providing insights into their performance and highlighting research opportunities.
AB - Discovering tables from poorly maintained data lakes is a signifcant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there’s a lack of a comprehensive benchmark for evaluating existing methods. To address this, we introduce LakeBench, a large-scale table discovery benchmark. It evaluates efectiveness, efciency, and scalability of table join & union search methods. With over 16 million real tables, LakeBench is 1,600X larger than existing datasets and 100X larger in storage size. It includes synthesized and real queries with ground truth, totaling more than 10 thousand queries – 10X more than used in any existing evaluation. We spent over 7,500 human hours labeling these queries and constructing diverse query categories for thorough evaluation. Our benchmark thoroughly evaluates stateof-the-art table discovery methods, providing insights into their performance and highlighting research opportunities.
UR - http://www.scopus.com/inward/record.url?scp=85195655575&partnerID=8YFLogxK
U2 - 10.14778/3659437.3659448
DO - 10.14778/3659437.3659448
M3 - Conference article
AN - SCOPUS:85195655575
SN - 2150-8097
VL - 17
SP - 1925
EP - 1938
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 8
Y2 - 24 August 2024 through 29 August 2024
ER -