Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration

Meihao Fan; Xiaoyue Han; Ju Fan; Chengliang Chai; Nan Tang; Guoliang Li; Xiaoyong Du

doi:10.1109/ICDE60146.2024.00284

Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration

Meihao Fan, Xiaoyue Han, Ju Fan^*, Chengliang Chai^*, Nan Tang, Guoliang Li, Xiaoyong Du

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BATCHER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.

Original language	English
Title of host publication	Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024
Publisher	IEEE Computer Society
Pages	3696-3709
Number of pages	14
ISBN (Electronic)	9798350317152
DOIs	https://doi.org/10.1109/ICDE60146.2024.00284
Publication status	Published - 2024
Event	40th IEEE International Conference on Data Engineering, ICDE 2024 - Utrecht, Netherlands Duration: 13 May 2024 → 17 May 2024

Publication series

Name	Proceedings - International Conference on Data Engineering
ISSN (Print)	1084-4627
ISSN (Electronic)	2375-0286

Conference

Conference	40th IEEE International Conference on Data Engineering, ICDE 2024
Country/Territory	Netherlands
City	Utrecht
Period	13/05/24 → 17/05/24

Keywords

Batch Prompting
Entity Resolution
Large Language Model

Access to Document

10.1109/ICDE60146.2024.00284

Cite this

Fan, M., Han, X., Fan, J., Chai, C., Tang, N., Li, G., & Du, X. (2024). Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration. In Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024 (pp. 3696-3709). (Proceedings - International Conference on Data Engineering). IEEE Computer Society. https://doi.org/10.1109/ICDE60146.2024.00284

@inproceedings{785e40c0ba03439f87518a7c6df269ca,

title = "Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration",

abstract = "Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BATCHER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.",

keywords = "Batch Prompting, Entity Resolution, Large Language Model",

author = "Meihao Fan and Xiaoyue Han and Ju Fan and Chengliang Chai and Nan Tang and Guoliang Li and Xiaoyong Du",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 40th IEEE International Conference on Data Engineering, ICDE 2024 ; Conference date: 13-05-2024 Through 17-05-2024",

year = "2024",

doi = "10.1109/ICDE60146.2024.00284",

language = "English",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE Computer Society",

pages = "3696--3709",

booktitle = "Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024",

address = "United States",

}

Fan, M, Han, X, Fan, J, Chai, C, Tang, N, Li, G & Du, X 2024, Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration. in Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024. Proceedings - International Conference on Data Engineering, IEEE Computer Society, pp. 3696-3709, 40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, Netherlands, 13/05/24. https://doi.org/10.1109/ICDE60146.2024.00284

Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration. / Fan, Meihao; Han, Xiaoyue; Fan, Ju et al.
Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024. IEEE Computer Society, 2024. p. 3696-3709 (Proceedings - International Conference on Data Engineering).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Cost-Effective In-Context Learning for Entity Resolution

T2 - 40th IEEE International Conference on Data Engineering, ICDE 2024

AU - Fan, Meihao

AU - Han, Xiaoyue

AU - Fan, Ju

AU - Chai, Chengliang

AU - Tang, Nan

AU - Li, Guoliang

AU - Du, Xiaoyong

PY - 2024

Y1 - 2024

N2 - Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BATCHER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.

AB - Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BATCHER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.

KW - Batch Prompting

KW - Entity Resolution

KW - Large Language Model

UR - http://www.scopus.com/inward/record.url?scp=85200487964&partnerID=8YFLogxK

U2 - 10.1109/ICDE60146.2024.00284

DO - 10.1109/ICDE60146.2024.00284

M3 - Conference contribution

AN - SCOPUS:85200487964

T3 - Proceedings - International Conference on Data Engineering

SP - 3696

EP - 3709

BT - Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024

PB - IEEE Computer Society

Y2 - 13 May 2024 through 17 May 2024

ER -

Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this