LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes

Yuhao Deng; Chengliang Chai; Lei Cao; Qin Yuan; Siyuan Chen; Yanrui Yu; Zhaoze Sun; Junyi Wang; Jiajun Li; Ziqi Cao; Kaisen Jin; Chi Zhang; Yuqing Jiang; Yuanfang Zhang; Yuping Wang; Ye Yuan; Guoren Wang; Nan Tang

doi:10.14778/3659437.3659448

LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes

Yuhao Deng, Chengliang Chai, Lei Cao, Qin Yuan, Siyuan Chen, Yanrui Yu, Zhaoze Sun, Junyi Wang, Jiajun Li, Ziqi Cao, Kaisen Jin, Chi Zhang, Yuqing Jiang, Yuanfang Zhang, Yuping Wang, Ye Yuan, Guoren Wang, Nan Tang

School of Computer Science and Technology

Research output: Contribution to journal › Conference article › peer-review

1 Citation (Scopus)

Abstract

Discovering tables from poorly maintained data lakes is a signifcant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there’s a lack of a comprehensive benchmark for evaluating existing methods. To address this, we introduce LakeBench, a large-scale table discovery benchmark. It evaluates efectiveness, efciency, and scalability of table join & union search methods. With over 16 million real tables, LakeBench is 1,600X larger than existing datasets and 100X larger in storage size. It includes synthesized and real queries with ground truth, totaling more than 10 thousand queries – 10X more than used in any existing evaluation. We spent over 7,500 human hours labeling these queries and constructing diverse query categories for thorough evaluation. Our benchmark thoroughly evaluates stateof-the-art table discovery methods, providing insights into their performance and highlighting research opportunities.

Original language	English
Pages (from-to)	1925-1938
Number of pages	14
Journal	Proceedings of the VLDB Endowment
Volume	17
Issue number	8
DOIs	https://doi.org/10.14778/3659437.3659448
Publication status	Published - 2024
Event	50th International Conference on Very Large Data Bases, VLDB 2024 - Guangzhou, China Duration: 24 Aug 2024 → 29 Aug 2024

Access to Document

10.14778/3659437.3659448

Cite this

Deng, Y., Chai, C., Cao, L., Yuan, Q., Chen, S., Yu, Y., Sun, Z., Wang, J., Li, J., Cao, Z., Jin, K., Zhang, C., Jiang, Y., Zhang, Y., Wang, Y., Yuan, Y., Wang, G., & Tang, N. (2024). LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes. Proceedings of the VLDB Endowment, 17(8), 1925-1938. https://doi.org/10.14778/3659437.3659448

@article{4e774a04dcd14c559aa734bd32ee1ba3,

title = "LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes",

abstract = "Discovering tables from poorly maintained data lakes is a signifcant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there{\textquoteright}s a lack of a comprehensive benchmark for evaluating existing methods. To address this, we introduce LakeBench, a large-scale table discovery benchmark. It evaluates efectiveness, efciency, and scalability of table join & union search methods. With over 16 million real tables, LakeBench is 1,600X larger than existing datasets and 100X larger in storage size. It includes synthesized and real queries with ground truth, totaling more than 10 thousand queries – 10X more than used in any existing evaluation. We spent over 7,500 human hours labeling these queries and constructing diverse query categories for thorough evaluation. Our benchmark thoroughly evaluates stateof-the-art table discovery methods, providing insights into their performance and highlighting research opportunities.",

author = "Yuhao Deng and Chengliang Chai and Lei Cao and Qin Yuan and Siyuan Chen and Yanrui Yu and Zhaoze Sun and Junyi Wang and Jiajun Li and Ziqi Cao and Kaisen Jin and Chi Zhang and Yuqing Jiang and Yuanfang Zhang and Yuping Wang and Ye Yuan and Guoren Wang and Nan Tang",

note = "Publisher Copyright: {\textcopyright} 2024, VLDB Endowment. All rights reserved.; 50th International Conference on Very Large Data Bases, VLDB 2024 ; Conference date: 24-08-2024 Through 29-08-2024",

year = "2024",

doi = "10.14778/3659437.3659448",

language = "English",

volume = "17",

pages = "1925--1938",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Very Large Data Base Endowment Inc.",

number = "8",

}

TY - JOUR

T1 - LakeBench

T2 - 50th International Conference on Very Large Data Bases, VLDB 2024

AU - Deng, Yuhao

AU - Chai, Chengliang

AU - Cao, Lei

AU - Yuan, Qin

AU - Chen, Siyuan

AU - Yu, Yanrui

AU - Sun, Zhaoze

AU - Wang, Junyi

AU - Li, Jiajun

AU - Cao, Ziqi

AU - Jin, Kaisen

AU - Zhang, Chi

AU - Jiang, Yuqing

AU - Zhang, Yuanfang

AU - Wang, Yuping

AU - Yuan, Ye

AU - Wang, Guoren

AU - Tang, Nan

PY - 2024

Y1 - 2024

N2 - Discovering tables from poorly maintained data lakes is a signifcant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there’s a lack of a comprehensive benchmark for evaluating existing methods. To address this, we introduce LakeBench, a large-scale table discovery benchmark. It evaluates efectiveness, efciency, and scalability of table join & union search methods. With over 16 million real tables, LakeBench is 1,600X larger than existing datasets and 100X larger in storage size. It includes synthesized and real queries with ground truth, totaling more than 10 thousand queries – 10X more than used in any existing evaluation. We spent over 7,500 human hours labeling these queries and constructing diverse query categories for thorough evaluation. Our benchmark thoroughly evaluates stateof-the-art table discovery methods, providing insights into their performance and highlighting research opportunities.

AB - Discovering tables from poorly maintained data lakes is a signifcant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there’s a lack of a comprehensive benchmark for evaluating existing methods. To address this, we introduce LakeBench, a large-scale table discovery benchmark. It evaluates efectiveness, efciency, and scalability of table join & union search methods. With over 16 million real tables, LakeBench is 1,600X larger than existing datasets and 100X larger in storage size. It includes synthesized and real queries with ground truth, totaling more than 10 thousand queries – 10X more than used in any existing evaluation. We spent over 7,500 human hours labeling these queries and constructing diverse query categories for thorough evaluation. Our benchmark thoroughly evaluates stateof-the-art table discovery methods, providing insights into their performance and highlighting research opportunities.

UR - http://www.scopus.com/inward/record.url?scp=85195655575&partnerID=8YFLogxK

U2 - 10.14778/3659437.3659448

DO - 10.14778/3659437.3659448

M3 - Conference article

AN - SCOPUS:85195655575

SN - 2150-8097

VL - 17

SP - 1925

EP - 1938

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 8

Y2 - 24 August 2024 through 29 August 2024

ER -

LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes

Abstract

Access to Document

Other files and links

Fingerprint

Cite this