TY - GEN
T1 - Synthesizing Privacy Preserving Entity Resolution Datasets
AU - Qinl, Xuedi
AU - Chai, Chengliang
AU - Tang, Nan
AU - Li, Jian
AU - Luo, Yuyu
AU - Li, Guoliang
AU - Zhu, Yaoyu
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Entity resolution (ER) is a core problem in data integration. Many companies have lots of datasets where ER needs to be conducted to integrate the data. On the one hand, it is nontrivial for non-ER experts within companies to design ER solutions. On the other hand, most companies are reluctant to release their real datasets for multiple reasons (e.g., privacy issues). A typical solution from the machine learning (ML) and the statistical community is to create surrogate (a.k.a. analogous) datasets based on the real dataset, release these surrogate datasets to the public to train ML models, such that these models trained on surrogate datasets can be either directly used or be adapted for the real dataset by the companies. In this paper, we study a new problem of synthesizing surrogate ER datasets using transformer models, with the goal that the ER model trained on the synthesized dataset can be used directly on the real dataset. We propose privacy preserving methods to synthesize ER datasets: we first learn the true similarity distributions of both matching and non-matching entity pairs from real dataset. We then devise algorithms that satisfy differential privacy and can synthesize fake but semantically meaningful entities, add matching and non-matching labels to these fake entity pairs, and ensure that the fake and real datasets have similar distributions. We also describe a method for entity rejection to avoid synthesizing bad fake entities that may destroy the original distributions. Extensive experiments show that ER matchers trained on real and synthetic ER datasets have very close performance on the same test sets - their F1 scores differ within 6% on 3 commonly used ER datasets, and their average precision, recall differences are less than 5%.
AB - Entity resolution (ER) is a core problem in data integration. Many companies have lots of datasets where ER needs to be conducted to integrate the data. On the one hand, it is nontrivial for non-ER experts within companies to design ER solutions. On the other hand, most companies are reluctant to release their real datasets for multiple reasons (e.g., privacy issues). A typical solution from the machine learning (ML) and the statistical community is to create surrogate (a.k.a. analogous) datasets based on the real dataset, release these surrogate datasets to the public to train ML models, such that these models trained on surrogate datasets can be either directly used or be adapted for the real dataset by the companies. In this paper, we study a new problem of synthesizing surrogate ER datasets using transformer models, with the goal that the ER model trained on the synthesized dataset can be used directly on the real dataset. We propose privacy preserving methods to synthesize ER datasets: we first learn the true similarity distributions of both matching and non-matching entity pairs from real dataset. We then devise algorithms that satisfy differential privacy and can synthesize fake but semantically meaningful entities, add matching and non-matching labels to these fake entity pairs, and ensure that the fake and real datasets have similar distributions. We also describe a method for entity rejection to avoid synthesizing bad fake entities that may destroy the original distributions. Extensive experiments show that ER matchers trained on real and synthetic ER datasets have very close performance on the same test sets - their F1 scores differ within 6% on 3 commonly used ER datasets, and their average precision, recall differences are less than 5%.
KW - Data Synthesis
KW - Entity Resolution
UR - http://www.scopus.com/inward/record.url?scp=85136428800&partnerID=8YFLogxK
U2 - 10.1109/ICDE53745.2022.00222
DO - 10.1109/ICDE53745.2022.00222
M3 - Conference contribution
AN - SCOPUS:85136428800
T3 - Proceedings - International Conference on Data Engineering
SP - 2359
EP - 2371
BT - Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
PB - IEEE Computer Society
T2 - 38th IEEE International Conference on Data Engineering, ICDE 2022
Y2 - 9 May 2022 through 12 May 2022
ER -