Abstract
In energy-efficient schemes, finding the optimal size of deep learning models is very important and has a broad impact. Meanwhile, recent studies have reported an unexpected phenomenon, the sparse double descent: as the model’s sparsity increases, the performance first worsens, then improves, and finally deteriorates. Such a non-monotonic behavior raises serious questions about the optimal model’s size to maintain high performance: the model needs to be sufficiently over-parametrized, but having too many parameters wastes training resources.
In this paper, we aim to find the best trade-off efficiently. More precisely, we tackle the occurrence of the sparse double descent and present some solutions to avoid it. Firstly, we show that a simple \(\ell _2\) regularization method can help to mitigate this phenomenon but sacrifices the performance/sparsity compromise. To overcome this problem, we then introduce a learning scheme in which distilling knowledge regularizes the student model. Supported by experimental results achieved using typical image classification setups, we show that this approach leads to the avoidance of such a phenomenon.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aghli, N., Ribeiro, E.: Combining weight pruning and knowledge distillation for CNN compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3191–3198 (2021)
Anthony, L.F.W., Kanding, B., Selvan, R.: Carbontracker: tracking and predicting the carbon footprint of training deep learning models. In: ICML Workshop on Challenges in Deploying and monitoring Machine Learning Systems, July 2020. arXiv:2007.03051
Arani, E., Sarfraz, F., Zonooz, B.: Noise as a resource for learning in knowledge distillation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3129–3138 (2021)
Barbano, C.A., Tartaglione, E., Berzovini, C., Calandri, M., Grangetto, M.: A two-step radiologist-like approach for Covid-19 computer-aided diagnosis from chest X-ray images. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) ICIAP 2022, Part I. LNCS, vol. 13231, pp. 173–184. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06427-2_15
Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. 116(32), 15849–15854 (2019)
Bragagnolo, A., Barbano, C.A.: Simplify: a python library for optimizing pruned neural networks. SoftwareX 17, 100907 (2022). https://doi.org/10.1016/j.softx.2021.100907. https://www.sciencedirect.com/science/article/pii/S2352711021001576
Bragagnolo, A., Tartaglione, E., Grangetto, M.: To update or not to update? Neurons at equilibrium in deep models. In: Advances in Neural Information Processing Systems (2022)
Chang, X., Li, Y., Oymak, S., Thrampoulidis, C.: Provable benefits of overparameterization in model compression: from double descent to pruning neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 6974–6983 (2021)
Chaudhry, H.A.H., et al.: Lung nodules segmentation with deephealth toolkit. In: Mazzeo, P.L., Frontoni, E., Sclaroff, S., Distante, C. (eds.) ICIAP 2022, Part I. LNCS, vol. 13373, pp. 487–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13321-3_43
Chen, L., Chen, Y., Xi, J., Le, X.: Knowledge from the original network: restore a better pruned network with knowledge distillation. Complex Intell. Syst. 8(2), 709–718 (2022)
Cotter, A., Menon, A.K., Narasimhan, H., Rawat, A.S., Reddi, S.J., Zhou, Y.: Distilling double descent. arXiv preprint arXiv:2102.06849 (2021)
Cui, B., Li, Y., Zhang, Z.: Joint structured pruning and dense knowledge distillation for efficient transformer model compression. Neurocomputing 458, 56–69 (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Evchenko, M., Vanschoren, J., Hoos, H.H., Schoenauer, M., Sebag, M.: Frugal machine learning. arXiv preprint arXiv:2111.03731 (2021)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJl-b3RcF7
Gale, T., Elsen, E., Hooker, S.: The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019)
Geiger, M., et al.: Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Phys. Rev. E 100(1), 012115 (2019)
Gupta, S., Gupta, A.: Dealing with noise problem in machine learning data-sets: a systematic review. Procedia Comput. Sci. 161, 466–474 (2019)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. Adv. Neural Inf. Process. Syst. 28 (2015)
He, Z., Xie, Z., Zhu, Q., Qin, Z.: Sparse double descent: where network pruning aggravates overfitting. In: International Conference on Machine Learning, pp. 8635–8659. PMLR (2022)
Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.025312(7) (2015)
Kaiser, T., Ehmann, L., Reinders, C., Rosenhahn, B.: Blind knowledge distillation for robust image classification. arXiv preprint arXiv:2211.11355 (2022)
Kim, T., Oh, J., Kim, N., Cho, S., Yun, S.Y.: Comparing Kullback-Leibler divergence and mean squared error loss in knowledge distillation. arXiv preprint arXiv:2105.08919 (2021)
Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016)
Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, L.J.: Learning from noisy labels with distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1910–1918 (2017)
Liu, T., Xie, S., Yu, J., Niu, L., Sun, W.: Classification of thyroid nodules in ultrasound images using deep model based transfer learning and hybrid features. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 919–923. IEEE (2017)
Ma, X., et al.: Dimensionality-driven learning with noisy labels. In: International Conference on Machine Learning, pp. 3355–3364. PMLR (2018)
Mazzeo, P.L., Frontoni, E., Sclaroff, S., Distante, C.: Image Analysis and Processing. ICIAP 2022 Workshops: ICIAP International Workshops, Lecce, Italy, May 23–27, 2022, Revised Selected Papers, Part I, vol. 13373. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13321-3
Meng, X., Yao, J., Cao, Y.: Multiple descent in the multiple random feature model. arXiv preprint arXiv:2208.09897 (2022)
Miller, D.J., Xiang, Z., Kesidis, G.: Adversarial learning targeting deep neural network classification: a comprehensive review of defenses against attacks. Proc. IEEE 108(3), 402–433 (2020). https://doi.org/10.1109/JPROC.2020.2970615
Muthukumar, V., Vodrahalli, K., Subramanian, V., Sahai, A.: Harmless interpolation of noisy data in regression. IEEE J. Sel. Areas Inf. Theory 1(1), 67–83 (2020)
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=B1g5sA4twr
Nakkiran, P., Venkat, P., Kakade, S.M., Ma, T.: Optimal regularization can mitigate double descent. In: International Conference on Learning Representations (2021)
Park, J., No, A.: Prune your model before distill it. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13671, pp. 120–136. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20083-0_8
Quétu, V., Milovanović, M., Tartaglione, E.: Sparse double descent in vision transformers: real or phantom threat? In: Foresti, G.L., Fusiello, A., Hancock, E. (eds.) ICIAP 2023. LNCS, vol. 14234, pp. 490–502. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-43153-1_41
Quétu, V., Tartaglione, E.: Dodging the sparse double descent. arXiv preprint arXiv:2303.01213 (2023)
Quétu, V., Tartaglione, E.: Can we avoid double descent in deep neural networks? arXiv preprint arXiv:2302.13259 (2023)
Ravanelli, M., et al.: Multi-task self-supervised learning for robust speech recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6989–6993. IEEE (2020)
Saglietti, L., Zdeborová, L.: Solvable model for inheriting the regularization through knowledge distillation. In: Mathematical and Scientific Machine Learning, pp. 809–846. PMLR (2022)
Sau, B.B., Balasubramanian, V.N.: Deep model compression: distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650 (2016)
Schwarz, J., Jayakumar, S., Pascanu, R., Latham, P.E., Teh, Y.: Powerpropagation: a sparsity inducing weight reparameterisation. Adv. Neural. Inf. Process. Syst. 34, 28889–28903 (2021)
Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G., Wyart, M.: A jamming transition from under-to over-parametrization affects generalization in deep learning. J. Phys. A: Math. Theor. 52(47), 474001 (2019)
Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., Liu, Y.: Learning with noisy labels revisited: a study using real-world human annotations. In: International Conference on Learning Representations (2022)
Xu, K., Rui, L., Li, Y., Gu, L.: Feature normalized knowledge distillation for image classification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 664–680. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_40
Yilmaz, F.F., Heckel, R.: Regularization-wise double descent: why it occurs and how to eliminate it. In: 2022 IEEE International Symposium on Information Theory (ISIT), pp. 426–431. IEEE (2022)
Zhou, Z., Zhou, Y., Jiang, Z., Men, A., Wang, H.: An efficient method for model pruning using knowledge distillation with few samples. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2515–2519. IEEE (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Quétu, V., Milovanović, M. (2025). The Quest of Finding the Antidote to Sparse Double Descent. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-74640-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74639-0
Online ISBN: 978-3-031-74640-6
eBook Packages: Artificial Intelligence (R0)