Domain Adaptation for Deep Entity Resolution

Jianhong Tu; Ju Fan; Nan Tang; Peng Wang; Chengliang Chai; Guoliang Li; Ruixue Fan; Xiaoyong Du

doi:10.1145/3514221.3517870

Domain Adaptation for Deep Entity Resolution

Jianhong Tu, Ju Fan^*, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, Xiaoyong Du

^*此作品的通讯作者

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

32 引用（Scopus）

摘要

Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA) results on ER are achieved by deep learning (DL) based methods, trained with a lot of labeled matching/non-matching entity pairs. This may not be a problem when using well-prepared benchmark datasets. Nevertheless, for many real-world ER applications, the situation changes dramatically, with a painful issue to collect large-scale labeled datasets. In this paper, we seek to answer: If we have a well-labeled source ER dataset, can we train a DL-based ER model for a target dataset, without any labels or with a few labels? This is known as domain adaptation (DA), which has achieved great successes in computer vision and natural language processing, but is not systematically studied for ER. Our goal is to systematically explore the benefits and limitations of a wide range of DA methods for ER. To this purpose, we develop a DADER (Domain Adaptation for Deep Entity Resolution) framework that significantly advances ER in applying DA. We define a space of design solutions for the three modules of DADER, namely Feature Extractor, Matcher, and Feature Aligner. We conduct so far the most comprehensive experimental study to explore the design space and compare different choices of DA for ER. We provide guidance for selecting appropriate design solutions based on extensive experiments.

源语言	英语
主期刊名	SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
出版商	Association for Computing Machinery
页	443-457
页数	15
ISBN（电子版）	9781450392495
DOI	https://doi.org/10.1145/3514221.3517870
出版状态	已出版 - 6月 2022
已对外发布	是
活动	2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022 - Hybrid, Philadelphia, 美国期限: 12 6月 2022 → 17 6月 2022

出版系列

姓名	Proceedings of the ACM SIGMOD International Conference on Management of Data
ISSN（印刷版）	0730-8078

会议

会议	2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
国家/地区	美国
市	Hybrid, Philadelphia
时期	12/06/22 → 17/06/22

访问文件

10.1145/3514221.3517870

其它文件与链接

链接到 Scopus 的出版物

引用此

Tu, J., Fan, J., Tang, N., Wang, P., Chai, C., Li, G., Fan, R., & Du, X. (2022). Domain Adaptation for Deep Entity Resolution. 在 SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data (页码 443-457). (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/3514221.3517870

@inproceedings{5e561045d6a84320ad63da7e0da04dce,

title = "Domain Adaptation for Deep Entity Resolution",

abstract = "Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA) results on ER are achieved by deep learning (DL) based methods, trained with a lot of labeled matching/non-matching entity pairs. This may not be a problem when using well-prepared benchmark datasets. Nevertheless, for many real-world ER applications, the situation changes dramatically, with a painful issue to collect large-scale labeled datasets. In this paper, we seek to answer: If we have a well-labeled source ER dataset, can we train a DL-based ER model for a target dataset, without any labels or with a few labels? This is known as domain adaptation (DA), which has achieved great successes in computer vision and natural language processing, but is not systematically studied for ER. Our goal is to systematically explore the benefits and limitations of a wide range of DA methods for ER. To this purpose, we develop a DADER (Domain Adaptation for Deep Entity Resolution) framework that significantly advances ER in applying DA. We define a space of design solutions for the three modules of DADER, namely Feature Extractor, Matcher, and Feature Aligner. We conduct so far the most comprehensive experimental study to explore the design space and compare different choices of DA for ER. We provide guidance for selecting appropriate design solutions based on extensive experiments.",

keywords = "data integration, deep learning, domain adaptation",

author = "Jianhong Tu and Ju Fan and Nan Tang and Peng Wang and Chengliang Chai and Guoliang Li and Ruixue Fan and Xiaoyong Du",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.; 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022 ; Conference date: 12-06-2022 Through 17-06-2022",

year = "2022",

month = jun,

doi = "10.1145/3514221.3517870",

language = "English",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

publisher = "Association for Computing Machinery",

pages = "443--457",

booktitle = "SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data",

}

Tu, J, Fan, J, Tang, N, Wang, P, Chai, C, Li, G, Fan, R & Du, X 2022, Domain Adaptation for Deep Entity Resolution. 在 SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, 页码 443-457, 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022, Hybrid, Philadelphia, 美国, 12/06/22. https://doi.org/10.1145/3514221.3517870

Domain Adaptation for Deep Entity Resolution. / Tu, Jianhong; Fan, Ju; Tang, Nan 等.
SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data. Association for Computing Machinery, 2022. 页码 443-457 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Domain Adaptation for Deep Entity Resolution

AU - Tu, Jianhong

AU - Fan, Ju

AU - Tang, Nan

AU - Wang, Peng

AU - Chai, Chengliang

AU - Li, Guoliang

AU - Fan, Ruixue

AU - Du, Xiaoyong

PY - 2022/6

Y1 - 2022/6

N2 - Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA) results on ER are achieved by deep learning (DL) based methods, trained with a lot of labeled matching/non-matching entity pairs. This may not be a problem when using well-prepared benchmark datasets. Nevertheless, for many real-world ER applications, the situation changes dramatically, with a painful issue to collect large-scale labeled datasets. In this paper, we seek to answer: If we have a well-labeled source ER dataset, can we train a DL-based ER model for a target dataset, without any labels or with a few labels? This is known as domain adaptation (DA), which has achieved great successes in computer vision and natural language processing, but is not systematically studied for ER. Our goal is to systematically explore the benefits and limitations of a wide range of DA methods for ER. To this purpose, we develop a DADER (Domain Adaptation for Deep Entity Resolution) framework that significantly advances ER in applying DA. We define a space of design solutions for the three modules of DADER, namely Feature Extractor, Matcher, and Feature Aligner. We conduct so far the most comprehensive experimental study to explore the design space and compare different choices of DA for ER. We provide guidance for selecting appropriate design solutions based on extensive experiments.

AB - Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA) results on ER are achieved by deep learning (DL) based methods, trained with a lot of labeled matching/non-matching entity pairs. This may not be a problem when using well-prepared benchmark datasets. Nevertheless, for many real-world ER applications, the situation changes dramatically, with a painful issue to collect large-scale labeled datasets. In this paper, we seek to answer: If we have a well-labeled source ER dataset, can we train a DL-based ER model for a target dataset, without any labels or with a few labels? This is known as domain adaptation (DA), which has achieved great successes in computer vision and natural language processing, but is not systematically studied for ER. Our goal is to systematically explore the benefits and limitations of a wide range of DA methods for ER. To this purpose, we develop a DADER (Domain Adaptation for Deep Entity Resolution) framework that significantly advances ER in applying DA. We define a space of design solutions for the three modules of DADER, namely Feature Extractor, Matcher, and Feature Aligner. We conduct so far the most comprehensive experimental study to explore the design space and compare different choices of DA for ER. We provide guidance for selecting appropriate design solutions based on extensive experiments.

KW - data integration

KW - deep learning

KW - domain adaptation

UR - http://www.scopus.com/inward/record.url?scp=85132729979&partnerID=8YFLogxK

U2 - 10.1145/3514221.3517870

DO - 10.1145/3514221.3517870

M3 - Conference contribution

AN - SCOPUS:85132729979

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 443

EP - 457

BT - SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data

PB - Association for Computing Machinery

T2 - 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022

Y2 - 12 June 2022 through 17 June 2022

ER -

Domain Adaptation for Deep Entity Resolution

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此