TY - JOUR
T1 - Privacy-Enhanced Database Synthesis for Benchmark Publishing
AU - Ge, Yunqing
AU - Qin, Jianbin
AU - Zheng, Shuyuan
AU - Zhong, Yongrui
AU - Tang, Bo
AU - Qiu, Yu Xuan
AU - Mao, Rui
AU - Yuan, Ye
AU - Onizuka, Makoto
AU - Xiao, Chuan
N1 - Publisher Copyright:
© 2025, VLDB Endowment. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating synthesized databases for benchmarking that also prioritize privacy protection. Differential privacy (DP)-based data synthesis has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or downstream ML tasks, with less attention given to benchmarking factors like query runtime performance. This paper delves into differentially private database synthesis specifically for benchmark publishing scenarios, aiming to produce a synthetic database whose benchmarking factors closely resemble those of the original data. Introducing PrivBench, an innovative synthesis framework based on sum-product networks (SPNs), we support the synthesis of high-quality benchmark databases that maintain fidelity in both data distribution and query runtime performance while preserving privacy. We validate that PrivBench can ensure database-level DP even when generating multi-relation databases with complex reference relationships. Our extensive experiments show that PrivBench efficiently synthesizes data that maintains privacy and excels in both data distribution similarity and query runtime similarity.
AB - Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating synthesized databases for benchmarking that also prioritize privacy protection. Differential privacy (DP)-based data synthesis has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or downstream ML tasks, with less attention given to benchmarking factors like query runtime performance. This paper delves into differentially private database synthesis specifically for benchmark publishing scenarios, aiming to produce a synthetic database whose benchmarking factors closely resemble those of the original data. Introducing PrivBench, an innovative synthesis framework based on sum-product networks (SPNs), we support the synthesis of high-quality benchmark databases that maintain fidelity in both data distribution and query runtime performance while preserving privacy. We validate that PrivBench can ensure database-level DP even when generating multi-relation databases with complex reference relationships. Our extensive experiments show that PrivBench efficiently synthesizes data that maintains privacy and excels in both data distribution similarity and query runtime similarity.
UR - http://www.scopus.com/inward/record.url?scp=86000019198&partnerID=8YFLogxK
U2 - 10.14778/3705829.3705855
DO - 10.14778/3705829.3705855
M3 - Conference article
AN - SCOPUS:86000019198
SN - 2150-8097
VL - 18
SP - 413
EP - 425
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 2
T2 - 51st International Conference on Very Large Data Bases, VLDB 2025
Y2 - 1 September 2025 through 5 September 2025
ER -