TY - GEN
T1 - Scale Down to Speed Up
T2 - 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
AU - Chen, Zhuoyue
AU - Zhang, Jihai
AU - Liu, Ben
AU - Lin, Fangquan
AU - Yin, Wotao
N1 - Publisher Copyright:
©2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Optimizing data utilization remains a central challenge in applying Reinforcement Learning (RL) to Large Language Models (LLMs), directly impacting sample efficiency, training stability, and final model performance. Current approaches often rely on massive static datasets, leading to computational inefficiency and redundant gradient updates. In this paper, we propose ScalingRL, a data-centric RL framework that dynamically selects the most informative training samples to optimize RL for mathematical reasoning. Specifically, ScalingRL introduces the Data Effectiveness Score (DES) that quantitatively ranks prompts according to three complementary factors: problem difficulty, Chain-of-Thought complexity, and reward adaptability. Then, ScalingRL employs an adaptive curriculum scheduler that progressively adjusts the overall scale and specific mix of training prompts—balancing exploration of new, challenging data with exploitation of previously learned concepts—thereby tailoring the data distribution to the model’s current learning trajectory and performance. Experimental results demonstrate that ScalingRL achieves comparable performance to full-data training methods while requiring only 1.5K samples instead of 220K, reducing training time from 13 days to just 4 hours on 8×A800 GPUs.
AB - Optimizing data utilization remains a central challenge in applying Reinforcement Learning (RL) to Large Language Models (LLMs), directly impacting sample efficiency, training stability, and final model performance. Current approaches often rely on massive static datasets, leading to computational inefficiency and redundant gradient updates. In this paper, we propose ScalingRL, a data-centric RL framework that dynamically selects the most informative training samples to optimize RL for mathematical reasoning. Specifically, ScalingRL introduces the Data Effectiveness Score (DES) that quantitatively ranks prompts according to three complementary factors: problem difficulty, Chain-of-Thought complexity, and reward adaptability. Then, ScalingRL employs an adaptive curriculum scheduler that progressively adjusts the overall scale and specific mix of training prompts—balancing exploration of new, challenging data with exploitation of previously learned concepts—thereby tailoring the data distribution to the model’s current learning trajectory and performance. Experimental results demonstrate that ScalingRL achieves comparable performance to full-data training methods while requiring only 1.5K samples instead of 220K, reducing training time from 13 days to just 4 hours on 8×A800 GPUs.
UR - https://www.scopus.com/pages/publications/105028940033
U2 - 10.18653/v1/2025.findings-emnlp.412
DO - 10.18653/v1/2025.findings-emnlp.412
M3 - Conference contribution
AN - SCOPUS:105028940033
T3 - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
SP - 7806
EP - 7817
BT - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
A2 - Christodoulopoulos, Christos
A2 - Chakraborty, Tanmoy
A2 - Rose, Carolyn
A2 - Peng, Violet
PB - Association for Computational Linguistics (ACL)
Y2 - 4 November 2025 through 9 November 2025
ER -