TY - GEN
T1 - Domain Adaptation for Deep Entity Resolution
AU - Tu, Jianhong
AU - Fan, Ju
AU - Tang, Nan
AU - Wang, Peng
AU - Chai, Chengliang
AU - Li, Guoliang
AU - Fan, Ruixue
AU - Du, Xiaoyong
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/6/10
Y1 - 2022/6/10
N2 - Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA) results on ER are achieved by deep learning (DL) based methods, trained with a lot of labeled matching/non-matching entity pairs. This may not be a problem when using well-prepared benchmark datasets. Nevertheless, for many real-world ER applications, the situation changes dramatically, with a painful issue to collect large-scale labeled datasets. In this paper, we seek to answer: If we have a well-labeled source ER dataset, can we train a DL-based ER model for a target dataset, without any labels or with a few labels? This is known as domain adaptation (DA), which has achieved great successes in computer vision and natural language processing, but is not systematically studied for ER. Our goal is to systematically explore the benefits and limitations of a wide range of DA methods for ER. To this purpose, we develop a DADER (Domain Adaptation for Deep Entity Resolution) framework that significantly advances ER in applying DA. We define a space of design solutions for the three modules of DADER, namely Feature Extractor, Matcher, and Feature Aligner. We conduct so far the most comprehensive experimental study to explore the design space and compare different choices of DA for ER. We provide guidance for selecting appropriate design solutions based on extensive experiments.
AB - Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA) results on ER are achieved by deep learning (DL) based methods, trained with a lot of labeled matching/non-matching entity pairs. This may not be a problem when using well-prepared benchmark datasets. Nevertheless, for many real-world ER applications, the situation changes dramatically, with a painful issue to collect large-scale labeled datasets. In this paper, we seek to answer: If we have a well-labeled source ER dataset, can we train a DL-based ER model for a target dataset, without any labels or with a few labels? This is known as domain adaptation (DA), which has achieved great successes in computer vision and natural language processing, but is not systematically studied for ER. Our goal is to systematically explore the benefits and limitations of a wide range of DA methods for ER. To this purpose, we develop a DADER (Domain Adaptation for Deep Entity Resolution) framework that significantly advances ER in applying DA. We define a space of design solutions for the three modules of DADER, namely Feature Extractor, Matcher, and Feature Aligner. We conduct so far the most comprehensive experimental study to explore the design space and compare different choices of DA for ER. We provide guidance for selecting appropriate design solutions based on extensive experiments.
KW - data integration
KW - deep learning
KW - domain adaptation
UR - http://www.scopus.com/inward/record.url?scp=85132729979&partnerID=8YFLogxK
U2 - 10.1145/3514221.3517870
DO - 10.1145/3514221.3517870
M3 - Conference contribution
AN - SCOPUS:85132729979
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 443
EP - 457
BT - SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
PB - Association for Computing Machinery
T2 - 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
Y2 - 12 June 2022 through 17 June 2022
ER -