Skip to main navigation Skip to search Skip to main content

LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning

  • Xiaotian Lin
  • , Yanlin Qi
  • , Yizhang Zhu
  • , Themis Palpanas
  • , Chengliang Chai
  • , Nan Tang
  • , Yuyu Luo*
  • *Corresponding author for this work
  • The Hong Kong University of Science and Technology (Guangzhou)
  • Université Paris Cité
  • BIT

Research output: Contribution to journalConference articlepeer-review

Abstract

Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as they rely on repeatedly performing full-dataset model inference to estimate sample utility for subsequent training iterations. In this paper, we propose LEAD, a framework that LEArns to select Data iteratively by accurately estimating sample utility entirely within the standard training loop, eliminating the need for additional model inference. At its core, LEAD introduces Instance-Level Dynamic Uncertainty (IDU), a theoretically grounded utility function combining instantaneous training loss, gradient-based approximation of loss changes, and exponential smoothing of historical loss signals. To further scale efficiently to large datasets, LEAD employs a two-stage, coarse-to-fine selection strategy, adaptively prioritizing informative clusters through a multi-armed bandit mechanism, followed by precise fine-grained selection of high-utility samples using IDU. Extensive experiments across four diverse benchmarks show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10×.

Original languageEnglish
Pages (from-to)426-439
Number of pages14
JournalProceedings of the VLDB Endowment
Volume19
Issue number3
DOIs
Publication statusPublished - 2025
Externally publishedYes
Event52nd International Conference on Very Large Data Bases, VLDB 2026 - Boston, United States
Duration: 31 Aug 20264 Sept 2026

Fingerprint

Dive into the research topics of 'LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning'. Together they form a unique fingerprint.

Cite this