Skip to main content

The Quest of Finding the Antidote to Sparse Double Descent

  • Conference paper
  • First Online:
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2023)

Abstract

In energy-efficient schemes, finding the optimal size of deep learning models is very important and has a broad impact. Meanwhile, recent studies have reported an unexpected phenomenon, the sparse double descent: as the model’s sparsity increases, the performance first worsens, then improves, and finally deteriorates. Such a non-monotonic behavior raises serious questions about the optimal model’s size to maintain high performance: the model needs to be sufficiently over-parametrized, but having too many parameters wastes training resources.

In this paper, we aim to find the best trade-off efficiently. More precisely, we tackle the occurrence of the sparse double descent and present some solutions to avoid it. Firstly, we show that a simple \(\ell _2\) regularization method can help to mitigate this phenomenon but sacrifices the performance/sparsity compromise. To overcome this problem, we then introduce a learning scheme in which distilling knowledge regularizes the student model. Supported by experimental results achieved using typical image classification setups, we show that this approach leads to the avoidance of such a phenomenon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aghli, N., Ribeiro, E.: Combining weight pruning and knowledge distillation for CNN compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3191–3198 (2021)

    Google Scholar 

  2. Anthony, L.F.W., Kanding, B., Selvan, R.: Carbontracker: tracking and predicting the carbon footprint of training deep learning models. In: ICML Workshop on Challenges in Deploying and monitoring Machine Learning Systems, July 2020. arXiv:2007.03051

  3. Arani, E., Sarfraz, F., Zonooz, B.: Noise as a resource for learning in knowledge distillation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3129–3138 (2021)

    Google Scholar 

  4. Barbano, C.A., Tartaglione, E., Berzovini, C., Calandri, M., Grangetto, M.: A two-step radiologist-like approach for Covid-19 computer-aided diagnosis from chest X-ray images. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) ICIAP 2022, Part I. LNCS, vol. 13231, pp. 173–184. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06427-2_15

    Chapter  Google Scholar 

  5. Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. 116(32), 15849–15854 (2019)

    Article  ADS  MathSciNet  PubMed  PubMed Central  MATH  Google Scholar 

  6. Bragagnolo, A., Barbano, C.A.: Simplify: a python library for optimizing pruned neural networks. SoftwareX 17, 100907 (2022). https://doi.org/10.1016/j.softx.2021.100907. https://www.sciencedirect.com/science/article/pii/S2352711021001576

  7. Bragagnolo, A., Tartaglione, E., Grangetto, M.: To update or not to update? Neurons at equilibrium in deep models. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  8. Chang, X., Li, Y., Oymak, S., Thrampoulidis, C.: Provable benefits of overparameterization in model compression: from double descent to pruning neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 6974–6983 (2021)

    Google Scholar 

  9. Chaudhry, H.A.H., et al.: Lung nodules segmentation with deephealth toolkit. In: Mazzeo, P.L., Frontoni, E., Sclaroff, S., Distante, C. (eds.) ICIAP 2022, Part I. LNCS, vol. 13373, pp. 487–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13321-3_43

    Chapter  MATH  Google Scholar 

  10. Chen, L., Chen, Y., Xi, J., Le, X.: Knowledge from the original network: restore a better pruned network with knowledge distillation. Complex Intell. Syst. 8(2), 709–718 (2022)

    Article  MATH  Google Scholar 

  11. Cotter, A., Menon, A.K., Narasimhan, H., Rawat, A.S., Reddi, S.J., Zhou, Y.: Distilling double descent. arXiv preprint arXiv:2102.06849 (2021)

  12. Cui, B., Li, Y., Zhang, Z.: Joint structured pruning and dense knowledge distillation for efficient transformer model compression. Neurocomputing 458, 56–69 (2021)

    Article  MATH  Google Scholar 

  13. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  14. Evchenko, M., Vanschoren, J., Hoos, H.H., Schoenauer, M., Sebag, M.: Frugal machine learning. arXiv preprint arXiv:2111.03731 (2021)

  15. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJl-b3RcF7

  16. Gale, T., Elsen, E., Hooker, S.: The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019)

  17. Geiger, M., et al.: Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Phys. Rev. E 100(1), 012115 (2019)

    Article  ADS  PubMed  MATH  Google Scholar 

  18. Gupta, S., Gupta, A.: Dealing with noise problem in machine learning data-sets: a systematic review. Procedia Comput. Sci. 161, 466–474 (2019)

    Article  MATH  Google Scholar 

  19. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. Adv. Neural Inf. Process. Syst. 28 (2015)

    Google Scholar 

  20. He, Z., Xie, Z., Zhu, Q., Qin, Z.: Sparse double descent: where network pruning aggravates overfitting. In: International Conference on Machine Learning, pp. 8635–8659. PMLR (2022)

    Google Scholar 

  21. Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.025312(7) (2015)

  22. Kaiser, T., Ehmann, L., Reinders, C., Rosenhahn, B.: Blind knowledge distillation for robust image classification. arXiv preprint arXiv:2211.11355 (2022)

  23. Kim, T., Oh, J., Kim, N., Cho, S., Yun, S.Y.: Comparing Kullback-Leibler divergence and mean squared error loss in knowledge distillation. arXiv preprint arXiv:2105.08919 (2021)

  24. Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016)

  25. Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, L.J.: Learning from noisy labels with distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1910–1918 (2017)

    Google Scholar 

  26. Liu, T., Xie, S., Yu, J., Niu, L., Sun, W.: Classification of thyroid nodules in ultrasound images using deep model based transfer learning and hybrid features. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 919–923. IEEE (2017)

    Google Scholar 

  27. Ma, X., et al.: Dimensionality-driven learning with noisy labels. In: International Conference on Machine Learning, pp. 3355–3364. PMLR (2018)

    Google Scholar 

  28. Mazzeo, P.L., Frontoni, E., Sclaroff, S., Distante, C.: Image Analysis and Processing. ICIAP 2022 Workshops: ICIAP International Workshops, Lecce, Italy, May 23–27, 2022, Revised Selected Papers, Part I, vol. 13373. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13321-3

    Book  MATH  Google Scholar 

  29. Meng, X., Yao, J., Cao, Y.: Multiple descent in the multiple random feature model. arXiv preprint arXiv:2208.09897 (2022)

  30. Miller, D.J., Xiang, Z., Kesidis, G.: Adversarial learning targeting deep neural network classification: a comprehensive review of defenses against attacks. Proc. IEEE 108(3), 402–433 (2020). https://doi.org/10.1109/JPROC.2020.2970615

    Article  MATH  Google Scholar 

  31. Muthukumar, V., Vodrahalli, K., Subramanian, V., Sahai, A.: Harmless interpolation of noisy data in regression. IEEE J. Sel. Areas Inf. Theory 1(1), 67–83 (2020)

    Article  MATH  Google Scholar 

  32. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=B1g5sA4twr

  33. Nakkiran, P., Venkat, P., Kakade, S.M., Ma, T.: Optimal regularization can mitigate double descent. In: International Conference on Learning Representations (2021)

    Google Scholar 

  34. Park, J., No, A.: Prune your model before distill it. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13671, pp. 120–136. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20083-0_8

    Chapter  Google Scholar 

  35. Quétu, V., Milovanović, M., Tartaglione, E.: Sparse double descent in vision transformers: real or phantom threat? In: Foresti, G.L., Fusiello, A., Hancock, E. (eds.) ICIAP 2023. LNCS, vol. 14234, pp. 490–502. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-43153-1_41

    Chapter  Google Scholar 

  36. Quétu, V., Tartaglione, E.: Dodging the sparse double descent. arXiv preprint arXiv:2303.01213 (2023)

  37. Quétu, V., Tartaglione, E.: Can we avoid double descent in deep neural networks? arXiv preprint arXiv:2302.13259 (2023)

  38. Ravanelli, M., et al.: Multi-task self-supervised learning for robust speech recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6989–6993. IEEE (2020)

    Google Scholar 

  39. Saglietti, L., Zdeborová, L.: Solvable model for inheriting the regularization through knowledge distillation. In: Mathematical and Scientific Machine Learning, pp. 809–846. PMLR (2022)

    Google Scholar 

  40. Sau, B.B., Balasubramanian, V.N.: Deep model compression: distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650 (2016)

  41. Schwarz, J., Jayakumar, S., Pascanu, R., Latham, P.E., Teh, Y.: Powerpropagation: a sparsity inducing weight reparameterisation. Adv. Neural. Inf. Process. Syst. 34, 28889–28903 (2021)

    Google Scholar 

  42. Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G., Wyart, M.: A jamming transition from under-to over-parametrization affects generalization in deep learning. J. Phys. A: Math. Theor. 52(47), 474001 (2019)

    Article  ADS  MathSciNet  MATH  Google Scholar 

  43. Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., Liu, Y.: Learning with noisy labels revisited: a study using real-world human annotations. In: International Conference on Learning Representations (2022)

    Google Scholar 

  44. Xu, K., Rui, L., Li, Y., Gu, L.: Feature normalized knowledge distillation for image classification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 664–680. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_40

    Chapter  Google Scholar 

  45. Yilmaz, F.F., Heckel, R.: Regularization-wise double descent: why it occurs and how to eliminate it. In: 2022 IEEE International Symposium on Information Theory (ISIT), pp. 426–431. IEEE (2022)

    Google Scholar 

  46. Zhou, Z., Zhou, Y., Jiang, Z., Men, A., Wang, H.: An efficient method for model pruning using knowledge distillation with few samples. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2515–2519. IEEE (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Victor Quétu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Quétu, V., Milovanović, M. (2025). The Quest of Finding the Antidote to Sparse Double Descent. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-74640-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-74639-0

  • Online ISBN: 978-3-031-74640-6

  • eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics