跳到主要导航 跳到搜索 跳到主要内容

SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

  • Xinyu Zhang
  • , Changzhi Zhou
  • , Linmei Hu*
  • , Luhao Zhang
  • , Xiancai Chen
  • , Haomin Fu
  • , Yang Yang
  • , Mengdi Zhang
  • *此作品的通讯作者
  • Beijing Institute of Technology
  • Peking University
  • Meituan

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.

源语言英语
主期刊名EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
编辑Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
出版商Association for Computational Linguistics (ACL)
20825-20841
页数17
ISBN(电子版)9798891763357
DOI
出版状态已出版 - 2025
已对外发布
活动30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - Suzhou, 中国
期限: 4 11月 20259 11月 2025

出版系列

姓名EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025

会议

会议30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
国家/地区中国
Suzhou
时期4/11/259/11/25

指纹

探究 'SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs' 的科研主题。它们共同构成独一无二的指纹。

引用此