The Quest of Finding the Antidote to Sparse Double Descent

Quétu, Victor; Milovanović, Marta

doi:10.1007/978-3-031-74640-6_12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2136))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

7 Accesses

Abstract

In energy-efficient schemes, finding the optimal size of deep learning models is very important and has a broad impact. Meanwhile, recent studies have reported an unexpected phenomenon, the sparse double descent: as the model’s sparsity increases, the performance first worsens, then improves, and finally deteriorates. Such a non-monotonic behavior raises serious questions about the optimal model’s size to maintain high performance: the model needs to be sufficiently over-parametrized, but having too many parameters wastes training resources.

In this paper, we aim to find the best trade-off efficiently. More precisely, we tackle the occurrence of the sparse double descent and present some solutions to avoid it. Firstly, we show that a simple $\ell _2$ regularization method can help to mitigate this phenomenon but sacrifices the performance/sparsity compromise. To overcome this problem, we then introduce a learning scheme in which distilling knowledge regularizes the student model. Supported by experimental results achieved using typical image classification setups, we show that this approach leads to the avoidance of such a phenomenon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Combined First- and Second-Order Directions for Deep Neural Networks Training

Statistical guarantees for sparse deep learning

Article Open access 24 January 2023

Non-smooth Bayesian learning for artificial neural networks

Article 25 June 2022

References

Aghli, N., Ribeiro, E.: Combining weight pruning and knowledge distillation for CNN compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3191–3198 (2021)
Google Scholar
Anthony, L.F.W., Kanding, B., Selvan, R.: Carbontracker: tracking and predicting the carbon footprint of training deep learning models. In: ICML Workshop on Challenges in Deploying and monitoring Machine Learning Systems, July 2020. arXiv:2007.03051
Arani, E., Sarfraz, F., Zonooz, B.: Noise as a resource for learning in knowledge distillation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3129–3138 (2021)
Google Scholar
Barbano, C.A., Tartaglione, E., Berzovini, C., Calandri, M., Grangetto, M.: A two-step radiologist-like approach for Covid-19 computer-aided diagnosis from chest X-ray images. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) ICIAP 2022, Part I. LNCS, vol. 13231, pp. 173–184. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06427-2_15
Chapter Google Scholar
Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. 116(32), 15849–15854 (2019)
Article ADS MathSciNet PubMed PubMed Central MATH Google Scholar
Bragagnolo, A., Barbano, C.A.: Simplify: a python library for optimizing pruned neural networks. SoftwareX 17, 100907 (2022). https://doi.org/10.1016/j.softx.2021.100907. https://www.sciencedirect.com/science/article/pii/S2352711021001576
Bragagnolo, A., Tartaglione, E., Grangetto, M.: To update or not to update? Neurons at equilibrium in deep models. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Chang, X., Li, Y., Oymak, S., Thrampoulidis, C.: Provable benefits of overparameterization in model compression: from double descent to pruning neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 6974–6983 (2021)
Google Scholar
Chaudhry, H.A.H., et al.: Lung nodules segmentation with deephealth toolkit. In: Mazzeo, P.L., Frontoni, E., Sclaroff, S., Distante, C. (eds.) ICIAP 2022, Part I. LNCS, vol. 13373, pp. 487–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13321-3_43
Chapter MATH Google Scholar
Chen, L., Chen, Y., Xi, J., Le, X.: Knowledge from the original network: restore a better pruned network with knowledge distillation. Complex Intell. Syst. 8(2), 709–718 (2022)
Article MATH Google Scholar
Cotter, A., Menon, A.K., Narasimhan, H., Rawat, A.S., Reddi, S.J., Zhou, Y.: Distilling double descent. arXiv preprint arXiv:2102.06849 (2021)
Cui, B., Li, Y., Zhang, Z.: Joint structured pruning and dense knowledge distillation for efficient transformer model compression. Neurocomputing 458, 56–69 (2021)
Article MATH Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Evchenko, M., Vanschoren, J., Hoos, H.H., Schoenauer, M., Sebag, M.: Frugal machine learning. arXiv preprint arXiv:2111.03731 (2021)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJl-b3RcF7
Gale, T., Elsen, E., Hooker, S.: The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019)
Geiger, M., et al.: Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Phys. Rev. E 100(1), 012115 (2019)
Article ADS PubMed MATH Google Scholar
Gupta, S., Gupta, A.: Dealing with noise problem in machine learning data-sets: a systematic review. Procedia Comput. Sci. 161, 466–474 (2019)
Article MATH Google Scholar
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. Adv. Neural Inf. Process. Syst. 28 (2015)
Google Scholar
He, Z., Xie, Z., Zhu, Q., Qin, Z.: Sparse double descent: where network pruning aggravates overfitting. In: International Conference on Machine Learning, pp. 8635–8659. PMLR (2022)
Google Scholar
Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.025312(7) (2015)
Kaiser, T., Ehmann, L., Reinders, C., Rosenhahn, B.: Blind knowledge distillation for robust image classification. arXiv preprint arXiv:2211.11355 (2022)
Kim, T., Oh, J., Kim, N., Cho, S., Yun, S.Y.: Comparing Kullback-Leibler divergence and mean squared error loss in knowledge distillation. arXiv preprint arXiv:2105.08919 (2021)
Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016)
Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, L.J.: Learning from noisy labels with distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1910–1918 (2017)
Google Scholar
Liu, T., Xie, S., Yu, J., Niu, L., Sun, W.: Classification of thyroid nodules in ultrasound images using deep model based transfer learning and hybrid features. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 919–923. IEEE (2017)
Google Scholar
Ma, X., et al.: Dimensionality-driven learning with noisy labels. In: International Conference on Machine Learning, pp. 3355–3364. PMLR (2018)
Google Scholar
Mazzeo, P.L., Frontoni, E., Sclaroff, S., Distante, C.: Image Analysis and Processing. ICIAP 2022 Workshops: ICIAP International Workshops, Lecce, Italy, May 23–27, 2022, Revised Selected Papers, Part I, vol. 13373. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13321-3
Book MATH Google Scholar
Meng, X., Yao, J., Cao, Y.: Multiple descent in the multiple random feature model. arXiv preprint arXiv:2208.09897 (2022)
Miller, D.J., Xiang, Z., Kesidis, G.: Adversarial learning targeting deep neural network classification: a comprehensive review of defenses against attacks. Proc. IEEE 108(3), 402–433 (2020). https://doi.org/10.1109/JPROC.2020.2970615
Article MATH Google Scholar
Muthukumar, V., Vodrahalli, K., Subramanian, V., Sahai, A.: Harmless interpolation of noisy data in regression. IEEE J. Sel. Areas Inf. Theory 1(1), 67–83 (2020)
Article MATH Google Scholar
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=B1g5sA4twr
Nakkiran, P., Venkat, P., Kakade, S.M., Ma, T.: Optimal regularization can mitigate double descent. In: International Conference on Learning Representations (2021)
Google Scholar
Park, J., No, A.: Prune your model before distill it. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13671, pp. 120–136. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20083-0_8
Chapter Google Scholar
Quétu, V., Milovanović, M., Tartaglione, E.: Sparse double descent in vision transformers: real or phantom threat? In: Foresti, G.L., Fusiello, A., Hancock, E. (eds.) ICIAP 2023. LNCS, vol. 14234, pp. 490–502. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-43153-1_41
Chapter Google Scholar
Quétu, V., Tartaglione, E.: Dodging the sparse double descent. arXiv preprint arXiv:2303.01213 (2023)
Quétu, V., Tartaglione, E.: Can we avoid double descent in deep neural networks? arXiv preprint arXiv:2302.13259 (2023)
Ravanelli, M., et al.: Multi-task self-supervised learning for robust speech recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6989–6993. IEEE (2020)
Google Scholar
Saglietti, L., Zdeborová, L.: Solvable model for inheriting the regularization through knowledge distillation. In: Mathematical and Scientific Machine Learning, pp. 809–846. PMLR (2022)
Google Scholar
Sau, B.B., Balasubramanian, V.N.: Deep model compression: distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650 (2016)
Schwarz, J., Jayakumar, S., Pascanu, R., Latham, P.E., Teh, Y.: Powerpropagation: a sparsity inducing weight reparameterisation. Adv. Neural. Inf. Process. Syst. 34, 28889–28903 (2021)
Google Scholar
Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G., Wyart, M.: A jamming transition from under-to over-parametrization affects generalization in deep learning. J. Phys. A: Math. Theor. 52(47), 474001 (2019)
Article ADS MathSciNet MATH Google Scholar
Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., Liu, Y.: Learning with noisy labels revisited: a study using real-world human annotations. In: International Conference on Learning Representations (2022)
Google Scholar
Xu, K., Rui, L., Li, Y., Gu, L.: Feature normalized knowledge distillation for image classification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 664–680. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_40
Chapter Google Scholar
Yilmaz, F.F., Heckel, R.: Regularization-wise double descent: why it occurs and how to eliminate it. In: 2022 IEEE International Symposium on Information Theory (ISIT), pp. 426–431. IEEE (2022)
Google Scholar
Zhou, Z., Zhou, Y., Jiang, Z., Men, A., Wang, H.: An efficient method for model pruning using knowledge distillation with few samples. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2515–2519. IEEE (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France
Victor Quétu & Marta Milovanović

Authors

Victor Quétu
View author publications
You can also search for this author in PubMed Google Scholar
Marta Milovanović
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Victor Quétu .

Editor information

Editors and Affiliations

University of Turin, Turin, Italy
Rosa Meo
Sapienza University of Rome, Rome, Italy
Fabrizio Silvestri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Quétu, V., Milovanović, M. (2025). The Quest of Finding the Antidote to Sparse Double Descent. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-74640-6_12
Published: 01 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74639-0
Online ISBN: 978-3-031-74640-6
eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics

The Quest of Finding the Antidote to Sparse Double Descent

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Combined First- and Second-Order Directions for Deep Neural Networks Training

Statistical guarantees for sparse deep learning

Non-smooth Bayesian learning for artificial neural networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

The Quest of Finding the Antidote to Sparse Double Descent

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Combined First- and Second-Order Directions for Deep Neural Networks Training

Statistical guarantees for sparse deep learning

Non-smooth Bayesian learning for artificial neural networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation