Synthesizing Privacy Preserving Entity Resolution Datasets

Xuedi Qinl, Chengliang Chai*, Nan Tang, Jian Li, Yuyu Luo, Guoliang Li*, Yaoyu Zhu

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

4 引用 (Scopus)

摘要

Entity resolution (ER) is a core problem in data integration. Many companies have lots of datasets where ER needs to be conducted to integrate the data. On the one hand, it is nontrivial for non-ER experts within companies to design ER solutions. On the other hand, most companies are reluctant to release their real datasets for multiple reasons (e.g., privacy issues). A typical solution from the machine learning (ML) and the statistical community is to create surrogate (a.k.a. analogous) datasets based on the real dataset, release these surrogate datasets to the public to train ML models, such that these models trained on surrogate datasets can be either directly used or be adapted for the real dataset by the companies. In this paper, we study a new problem of synthesizing surrogate ER datasets using transformer models, with the goal that the ER model trained on the synthesized dataset can be used directly on the real dataset. We propose privacy preserving methods to synthesize ER datasets: we first learn the true similarity distributions of both matching and non-matching entity pairs from real dataset. We then devise algorithms that satisfy differential privacy and can synthesize fake but semantically meaningful entities, add matching and non-matching labels to these fake entity pairs, and ensure that the fake and real datasets have similar distributions. We also describe a method for entity rejection to avoid synthesizing bad fake entities that may destroy the original distributions. Extensive experiments show that ER matchers trained on real and synthetic ER datasets have very close performance on the same test sets - their F1 scores differ within 6% on 3 commonly used ER datasets, and their average precision, recall differences are less than 5%.

源语言英语
主期刊名Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
出版商IEEE Computer Society
2359-2371
页数13
ISBN(电子版)9781665408837
DOI
出版状态已出版 - 2022
已对外发布
活动38th IEEE International Conference on Data Engineering, ICDE 2022 - Virtual, Online, 马来西亚
期限: 9 5月 202212 5月 2022

出版系列

姓名Proceedings - International Conference on Data Engineering
2022-May
ISSN(印刷版)1084-4627

会议

会议38th IEEE International Conference on Data Engineering, ICDE 2022
国家/地区马来西亚
Virtual, Online
时期9/05/2212/05/22

指纹

探究 'Synthesizing Privacy Preserving Entity Resolution Datasets' 的科研主题。它们共同构成独一无二的指纹。

引用此