Abstract
Pruning is a widely used technique for compressing large neural networks that eliminates weights with minimal impact on performance. Current pruning methods, exemplified by magnitude pruning, assign importance scores to weights based on their magnitude and remove those below a certain threshold. However, these methods introduce a gap between the original dense and pruned sparse models, potentially impairing performance, especially at high sparsity ratios. To address this issue, we introduce a method that bridges this gap through low-rank approximation of the difference between dense and sparse matrices. Our approach iteratively refines the sparse weight matrix with a low-rank adjustment, capturing essential information typically lost during pruning. We provide a comprehensive theoretical analysis of our method, establishing its convergence properties and efficacy. Experimental results on LLaMA models validate our method’s effectiveness across various pruning techniques and sparsity levels. At 50% sparsity, it reduces perplexity by 53.9% compared to conventional magnitude pruning on LLaMA-7B. Furthermore, our approach enables an 8.6% reduction in model parameters while maintaining a sparsity ratio of about 50%.
| Original language | English |
|---|---|
| Pages (from-to) | 54457-54475 |
| Number of pages | 19 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 267 |
| Publication status | Published - 2025 |
| Externally published | Yes |
| Event | 42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada Duration: 13 Jul 2025 → 19 Jul 2025 |