Synthesizing Privacy Preserving Entity Resolution Datasets

Xuedi Qinl, Chengliang Chai*, Nan Tang, Jian Li, Yuyu Luo, Guoliang Li*, Yaoyu Zhu

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Citations (Scopus)

Abstract

Entity resolution (ER) is a core problem in data integration. Many companies have lots of datasets where ER needs to be conducted to integrate the data. On the one hand, it is nontrivial for non-ER experts within companies to design ER solutions. On the other hand, most companies are reluctant to release their real datasets for multiple reasons (e.g., privacy issues). A typical solution from the machine learning (ML) and the statistical community is to create surrogate (a.k.a. analogous) datasets based on the real dataset, release these surrogate datasets to the public to train ML models, such that these models trained on surrogate datasets can be either directly used or be adapted for the real dataset by the companies. In this paper, we study a new problem of synthesizing surrogate ER datasets using transformer models, with the goal that the ER model trained on the synthesized dataset can be used directly on the real dataset. We propose privacy preserving methods to synthesize ER datasets: we first learn the true similarity distributions of both matching and non-matching entity pairs from real dataset. We then devise algorithms that satisfy differential privacy and can synthesize fake but semantically meaningful entities, add matching and non-matching labels to these fake entity pairs, and ensure that the fake and real datasets have similar distributions. We also describe a method for entity rejection to avoid synthesizing bad fake entities that may destroy the original distributions. Extensive experiments show that ER matchers trained on real and synthetic ER datasets have very close performance on the same test sets - their F1 scores differ within 6% on 3 commonly used ER datasets, and their average precision, recall differences are less than 5%.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
PublisherIEEE Computer Society
Pages2359-2371
Number of pages13
ISBN (Electronic)9781665408837
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event38th IEEE International Conference on Data Engineering, ICDE 2022 - Virtual, Online, Malaysia
Duration: 9 May 202212 May 2022

Publication series

NameProceedings - International Conference on Data Engineering
Volume2022-May
ISSN (Print)1084-4627

Conference

Conference38th IEEE International Conference on Data Engineering, ICDE 2022
Country/TerritoryMalaysia
CityVirtual, Online
Period9/05/2212/05/22

Keywords

  • Data Synthesis
  • Entity Resolution

Fingerprint

Dive into the research topics of 'Synthesizing Privacy Preserving Entity Resolution Datasets'. Together they form a unique fingerprint.

Cite this