TY - JOUR
T1 - DADER
T2 - 48th International Conference on Very Large Data Bases, VLDB 2022
AU - Tu, Jianhong
AU - Han, Xiaoyue
AU - Fan Fanj@Ruc.Edu.Cn, Ju
AU - Tang, Nan
AU - Chai, Chengliang
AU - Li, Guoliang
AU - Du, Xiaoyong
N1 - Publisher Copyright:
© 2022, VLDB Endowment. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Entity resolution (ER) is a core data integration problem that identifies pairs of data instances referring to the same real-world entities, and the state-of-the-art results of ER are achieved by deep learning (DL) based approaches. However, DL-based approaches typically require a large amount of labeled training data (i.e., matching and non-matching pairs), which incurs substantial manual labeling efforts. In this paper, we introduce DADER, a hands-off deep ER system through domain adaptation. DADER utilizes multiple well-labeled source ER datasets to train a DL-based ER model for a new target ER dataset that does not have any labels or with only a few labels. To address the key challenge of domain shift, DADER judiciously selects labeled entity pairs from the source and then aligns distributions of the source and the target by using six popular domain adaptation strategies. DADER can also harness the users to gather a few labels for further improvement. We have built DADER as an open-sourced Python Library with intuitive APIs and demonstrated its utility on supporting hands-off ER in real-world scenarios.
AB - Entity resolution (ER) is a core data integration problem that identifies pairs of data instances referring to the same real-world entities, and the state-of-the-art results of ER are achieved by deep learning (DL) based approaches. However, DL-based approaches typically require a large amount of labeled training data (i.e., matching and non-matching pairs), which incurs substantial manual labeling efforts. In this paper, we introduce DADER, a hands-off deep ER system through domain adaptation. DADER utilizes multiple well-labeled source ER datasets to train a DL-based ER model for a new target ER dataset that does not have any labels or with only a few labels. To address the key challenge of domain shift, DADER judiciously selects labeled entity pairs from the source and then aligns distributions of the source and the target by using six popular domain adaptation strategies. DADER can also harness the users to gather a few labels for further improvement. We have built DADER as an open-sourced Python Library with intuitive APIs and demonstrated its utility on supporting hands-off ER in real-world scenarios.
UR - http://www.scopus.com/inward/record.url?scp=85137997395&partnerID=8YFLogxK
U2 - 10.14778/3554821.3554870
DO - 10.14778/3554821.3554870
M3 - Conference article
AN - SCOPUS:85137997395
SN - 2150-8097
VL - 15
SP - 3666
EP - 3669
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 12
Y2 - 5 September 2022 through 9 September 2022
ER -