Synthesizing Privacy Preserving Entity Resolution Datasets

Xuedi Qinl; Chengliang Chai; Nan Tang; Jian Li; Yuyu Luo; Guoliang Li; Yaoyu Zhu

doi:10.1109/ICDE53745.2022.00222

Synthesizing Privacy Preserving Entity Resolution Datasets

Xuedi Qinl, Chengliang Chai^*, Nan Tang, Jian Li, Yuyu Luo, Guoliang Li^*, Yaoyu Zhu

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

6 Citations (Scopus)

Abstract

Entity resolution (ER) is a core problem in data integration. Many companies have lots of datasets where ER needs to be conducted to integrate the data. On the one hand, it is nontrivial for non-ER experts within companies to design ER solutions. On the other hand, most companies are reluctant to release their real datasets for multiple reasons (e.g., privacy issues). A typical solution from the machine learning (ML) and the statistical community is to create surrogate (a.k.a. analogous) datasets based on the real dataset, release these surrogate datasets to the public to train ML models, such that these models trained on surrogate datasets can be either directly used or be adapted for the real dataset by the companies. In this paper, we study a new problem of synthesizing surrogate ER datasets using transformer models, with the goal that the ER model trained on the synthesized dataset can be used directly on the real dataset. We propose privacy preserving methods to synthesize ER datasets: we first learn the true similarity distributions of both matching and non-matching entity pairs from real dataset. We then devise algorithms that satisfy differential privacy and can synthesize fake but semantically meaningful entities, add matching and non-matching labels to these fake entity pairs, and ensure that the fake and real datasets have similar distributions. We also describe a method for entity rejection to avoid synthesizing bad fake entities that may destroy the original distributions. Extensive experiments show that ER matchers trained on real and synthetic ER datasets have very close performance on the same test sets - their F1 scores differ within 6% on 3 commonly used ER datasets, and their average precision, recall differences are less than 5%.

Original language	English
Title of host publication	Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
Publisher	IEEE Computer Society
Pages	2359-2371
Number of pages	13
ISBN (Electronic)	9781665408837
DOIs	https://doi.org/10.1109/ICDE53745.2022.00222
Publication status	Published - 2022
Externally published	Yes
Event	38th IEEE International Conference on Data Engineering, ICDE 2022 - Virtual, Online, Malaysia Duration: 9 May 2022 → 12 May 2022

Publication series

Name	Proceedings - International Conference on Data Engineering
Volume	2022-May
ISSN (Print)	1084-4627

Conference

Conference	38th IEEE International Conference on Data Engineering, ICDE 2022
Country/Territory	Malaysia
City	Virtual, Online
Period	9/05/22 → 12/05/22

Keywords

Data Synthesis
Entity Resolution

Access to Document

10.1109/ICDE53745.2022.00222

Cite this

Qinl, X., Chai, C., Tang, N., Li, J., Luo, Y., Li, G., & Zhu, Y. (2022). Synthesizing Privacy Preserving Entity Resolution Datasets. In Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022 (pp. 2359-2371). (Proceedings - International Conference on Data Engineering; Vol. 2022-May). IEEE Computer Society. https://doi.org/10.1109/ICDE53745.2022.00222

@inproceedings{0497cd37dd054d1c923457af61a9c61a,

title = "Synthesizing Privacy Preserving Entity Resolution Datasets",

abstract = "Entity resolution (ER) is a core problem in data integration. Many companies have lots of datasets where ER needs to be conducted to integrate the data. On the one hand, it is nontrivial for non-ER experts within companies to design ER solutions. On the other hand, most companies are reluctant to release their real datasets for multiple reasons (e.g., privacy issues). A typical solution from the machine learning (ML) and the statistical community is to create surrogate (a.k.a. analogous) datasets based on the real dataset, release these surrogate datasets to the public to train ML models, such that these models trained on surrogate datasets can be either directly used or be adapted for the real dataset by the companies. In this paper, we study a new problem of synthesizing surrogate ER datasets using transformer models, with the goal that the ER model trained on the synthesized dataset can be used directly on the real dataset. We propose privacy preserving methods to synthesize ER datasets: we first learn the true similarity distributions of both matching and non-matching entity pairs from real dataset. We then devise algorithms that satisfy differential privacy and can synthesize fake but semantically meaningful entities, add matching and non-matching labels to these fake entity pairs, and ensure that the fake and real datasets have similar distributions. We also describe a method for entity rejection to avoid synthesizing bad fake entities that may destroy the original distributions. Extensive experiments show that ER matchers trained on real and synthetic ER datasets have very close performance on the same test sets - their F1 scores differ within 6% on 3 commonly used ER datasets, and their average precision, recall differences are less than 5%.",

keywords = "Data Synthesis, Entity Resolution",

author = "Xuedi Qinl and Chengliang Chai and Nan Tang and Jian Li and Yuyu Luo and Guoliang Li and Yaoyu Zhu",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 38th IEEE International Conference on Data Engineering, ICDE 2022 ; Conference date: 09-05-2022 Through 12-05-2022",

year = "2022",

doi = "10.1109/ICDE53745.2022.00222",

language = "English",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE Computer Society",

pages = "2359--2371",

booktitle = "Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022",

address = "United States",

}

Qinl, X, Chai, C, Tang, N, Li, J, Luo, Y, Li, G & Zhu, Y 2022, Synthesizing Privacy Preserving Entity Resolution Datasets. in Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022. Proceedings - International Conference on Data Engineering, vol. 2022-May, IEEE Computer Society, pp. 2359-2371, 38th IEEE International Conference on Data Engineering, ICDE 2022, Virtual, Online, Malaysia, 9/05/22. https://doi.org/10.1109/ICDE53745.2022.00222

Synthesizing Privacy Preserving Entity Resolution Datasets. / Qinl, Xuedi; Chai, Chengliang; Tang, Nan et al.
Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022. IEEE Computer Society, 2022. p. 2359-2371 (Proceedings - International Conference on Data Engineering; Vol. 2022-May).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Synthesizing Privacy Preserving Entity Resolution Datasets

AU - Qinl, Xuedi

AU - Chai, Chengliang

AU - Tang, Nan

AU - Li, Jian

AU - Luo, Yuyu

AU - Li, Guoliang

AU - Zhu, Yaoyu

PY - 2022

Y1 - 2022

N2 - Entity resolution (ER) is a core problem in data integration. Many companies have lots of datasets where ER needs to be conducted to integrate the data. On the one hand, it is nontrivial for non-ER experts within companies to design ER solutions. On the other hand, most companies are reluctant to release their real datasets for multiple reasons (e.g., privacy issues). A typical solution from the machine learning (ML) and the statistical community is to create surrogate (a.k.a. analogous) datasets based on the real dataset, release these surrogate datasets to the public to train ML models, such that these models trained on surrogate datasets can be either directly used or be adapted for the real dataset by the companies. In this paper, we study a new problem of synthesizing surrogate ER datasets using transformer models, with the goal that the ER model trained on the synthesized dataset can be used directly on the real dataset. We propose privacy preserving methods to synthesize ER datasets: we first learn the true similarity distributions of both matching and non-matching entity pairs from real dataset. We then devise algorithms that satisfy differential privacy and can synthesize fake but semantically meaningful entities, add matching and non-matching labels to these fake entity pairs, and ensure that the fake and real datasets have similar distributions. We also describe a method for entity rejection to avoid synthesizing bad fake entities that may destroy the original distributions. Extensive experiments show that ER matchers trained on real and synthetic ER datasets have very close performance on the same test sets - their F1 scores differ within 6% on 3 commonly used ER datasets, and their average precision, recall differences are less than 5%.

AB - Entity resolution (ER) is a core problem in data integration. Many companies have lots of datasets where ER needs to be conducted to integrate the data. On the one hand, it is nontrivial for non-ER experts within companies to design ER solutions. On the other hand, most companies are reluctant to release their real datasets for multiple reasons (e.g., privacy issues). A typical solution from the machine learning (ML) and the statistical community is to create surrogate (a.k.a. analogous) datasets based on the real dataset, release these surrogate datasets to the public to train ML models, such that these models trained on surrogate datasets can be either directly used or be adapted for the real dataset by the companies. In this paper, we study a new problem of synthesizing surrogate ER datasets using transformer models, with the goal that the ER model trained on the synthesized dataset can be used directly on the real dataset. We propose privacy preserving methods to synthesize ER datasets: we first learn the true similarity distributions of both matching and non-matching entity pairs from real dataset. We then devise algorithms that satisfy differential privacy and can synthesize fake but semantically meaningful entities, add matching and non-matching labels to these fake entity pairs, and ensure that the fake and real datasets have similar distributions. We also describe a method for entity rejection to avoid synthesizing bad fake entities that may destroy the original distributions. Extensive experiments show that ER matchers trained on real and synthetic ER datasets have very close performance on the same test sets - their F1 scores differ within 6% on 3 commonly used ER datasets, and their average precision, recall differences are less than 5%.

KW - Data Synthesis

KW - Entity Resolution

UR - http://www.scopus.com/inward/record.url?scp=85136428800&partnerID=8YFLogxK

U2 - 10.1109/ICDE53745.2022.00222

DO - 10.1109/ICDE53745.2022.00222

M3 - Conference contribution

AN - SCOPUS:85136428800

T3 - Proceedings - International Conference on Data Engineering

SP - 2359

EP - 2371

BT - Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022

PB - IEEE Computer Society

T2 - 38th IEEE International Conference on Data Engineering, ICDE 2022

Y2 - 9 May 2022 through 12 May 2022

ER -

Synthesizing Privacy Preserving Entity Resolution Datasets

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this