TY - GEN
T1 - CrossEM
T2 - 41st IEEE International Conference on Data Engineering, ICDE 2025
AU - Yuan, Qin
AU - Yuan, Ye
AU - Wen, Zhenyu
AU - Chen, Chi
AU - Wang, Guoren
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Entity matching (EM) aims to identify equivalent entities across different data sources. Current EM assumes that these data are either homogeneous with aligned schema or heterogeneous but can be transformed into a unified modality. There is an urgent need to consider the entities with different modalities to support practical application scenarios over data lakes such as multi-modal data integration and recommendation system. It is impractical to unify their data modalities. To support EM on heterogeneous entity with different data formats and modalities, we propose cross-modal entity matching in this paper. Inspired by the promising performance achieved by recent pre-trained models, we perform cross-modal entity matching by prompt-tuning pre-trained multi-modal large models (MMLMs) in an unsupervised manner. However, the prompt-tuning faces three challenging issues: (i) objective gap between pre-training and tuning of MMLMs; (ii) data modality gap between the inputs of MMLMs and our matching task; (iii) prompt efficiency on large data. Therefore, we firstly propose a novel EM framework (namely, CrossEM) that addresses cross-modal EM as a matching probability problem with specific prompt-tuning. Secondly, two alternative prompt generation methods are designed to extract structural knowledge from heterogeneous data to overcome the data modality gap with pre-trained models. Thirdly, we present an improved matching framework (namely, CrossEM+) to boost the prompt efficiency on large heterogeneous data. Experimental evaluations verify that our methods significantly outperform the state-of-the-art approaches on three benchmarks. Furthermore, our case study highlights the considerable potential of cross-modal EM in improving the performance of downstream tasks, thereby benefitting a wider range of research areas.
AB - Entity matching (EM) aims to identify equivalent entities across different data sources. Current EM assumes that these data are either homogeneous with aligned schema or heterogeneous but can be transformed into a unified modality. There is an urgent need to consider the entities with different modalities to support practical application scenarios over data lakes such as multi-modal data integration and recommendation system. It is impractical to unify their data modalities. To support EM on heterogeneous entity with different data formats and modalities, we propose cross-modal entity matching in this paper. Inspired by the promising performance achieved by recent pre-trained models, we perform cross-modal entity matching by prompt-tuning pre-trained multi-modal large models (MMLMs) in an unsupervised manner. However, the prompt-tuning faces three challenging issues: (i) objective gap between pre-training and tuning of MMLMs; (ii) data modality gap between the inputs of MMLMs and our matching task; (iii) prompt efficiency on large data. Therefore, we firstly propose a novel EM framework (namely, CrossEM) that addresses cross-modal EM as a matching probability problem with specific prompt-tuning. Secondly, two alternative prompt generation methods are designed to extract structural knowledge from heterogeneous data to overcome the data modality gap with pre-trained models. Thirdly, we present an improved matching framework (namely, CrossEM+) to boost the prompt efficiency on large heterogeneous data. Experimental evaluations verify that our methods significantly outperform the state-of-the-art approaches on three benchmarks. Furthermore, our case study highlights the considerable potential of cross-modal EM in improving the performance of downstream tasks, thereby benefitting a wider range of research areas.
KW - cross-modal entity matching
KW - data lake
KW - prompt tuning
UR - https://www.scopus.com/pages/publications/105015407931
U2 - 10.1109/ICDE65448.2025.00053
DO - 10.1109/ICDE65448.2025.00053
M3 - Conference contribution
AN - SCOPUS:105015407931
T3 - Proceedings - International Conference on Data Engineering
SP - 627
EP - 640
BT - Proceedings - 2025 IEEE 41st International Conference on Data Engineering, ICDE 2025
PB - IEEE Computer Society
Y2 - 19 May 2025 through 23 May 2025
ER -