Skip to main content
Log in

A survey of regularization strategies for deep models

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

The most critical concern in machine learning is how to make an algorithm that performs well both on training data and new data. No free lunch theorem implies that each specific task needs its own tailored machine learning algorithm to be designed. A set of strategies and preferences are built into learning machines to tune them for the problem at hand. These strategies and preferences, with the core concern of generalization improvement, are collectively known as regularization. In deep learning, because of a considerable number of parameters, a great many forms of regularization methods are available to the deep learning community. Developing more effective regularization strategies has been the subject of significant research efforts in recent years. However, it is difficult for developers to choose the most suitable strategy for their problem at hand, because there is no comparative study regarding the performance of different strategies. In this paper, at the first step, the most effective regularization methods and their variants are presented and analyzed in a systematic approach. At the second step, comparative research on regularization techniques is presented in which the testing errors and computational costs are evaluated in a convolutional neural network, using CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar.html) dataset. In the end, different regularization methods are compared in terms of accuracy of the network, the number of epochs for the network to be trained and the number of operations per input sample. Also, the results are discussed and interpreted based on the employed strategy. The experiment results showed that weight decay and data augmentation regularizations have little computational side effects so can be used in most applications. In the case of enough computational resources, Dropout family methods are rational to be used. Moreover, in the case of abundant computational resources, batch normalization family and ensemble methods are reasonable strategies to be employed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. A learning algorithm that compares new problem instances with instances in the training set.

  2. A model in which a graph expresses the conditional dependence structure between random variables.

  3. Natural Language Processing.

  4. Part-of-speech tagging.

  5. Name entity recognition.

  6. semantic-role labeling.

  7. Dense Convolutional Network.

  8. Long Short-Term Memory.

References

  • Aha D, Kibler D, Albert M (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    Google Scholar 

  • Ba JL, Kiros JR, Hinton GE (2016) Layer normalization, arXiv:1607.06450

  • Bartle RG (1995) The elements of integration and Lebesgue measure. Wiley, New Yor

    MATH  Google Scholar 

  • Bordes A, Chopra S, Weston J (2014) Question answering with subgraph embeddings. In: Empirical methods in natural language processing

  • Bouthillier X, Konda K, Vincent P, Memisevic R (2015) Dropout as data augmentation, arXiv:1506.08700

  • Breiman L (1994) Bagging predictors. Mach Learn, pp 123–140

  • Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: British machine vision

  • Chen PY, Zhang H, Sharma Y, Yi J, Hsieh CJ (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proceedings of the 10th ACM workshop on artificial intelligence and security

  • Cohen D, Mitra B, Hofmann K, Croft WB (2018) Cross domain regularization for neural ranking models using adversarial learning, arXiv:1805.03403v1

  • Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

    MATH  Google Scholar 

  • DeVries T, Taylor GW (2017) Improved regularization of convolutional neural networks with cutout, arXiv:1708.04552, 2017

  • Domingos P (2000) A unified bias-variance decomposition and its applications. In: International conference on machine learning

  • Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55(10):78–87

    Google Scholar 

  • Dong Y, Liao F, Pang T, Su H, Hu X, Li J, Zhu J (2017) Boosting adversarial attacks with momentum, arXiv:1710.06081

  • Erhan D, Manzagol PA, Bengio Y, Bengio S, Vincent P (2009) The difficulty of training deep architectures and the effect of unsupervised pre-training. In: AISTATS

  • Frazão XF, Alexandre LA (2014) DropAll: generalization of two convolutional neural network regularization methods. In: International conference on image analysis and recognition

  • Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainity in deep learning. In: Proceedings of the international conference on machine learning

  • Gastaldi X (2017) Shake–shake regularization, arXiv:1705.07485

  • Ghahramani Z (2015) Probabilistic machine learning and artificial intelligence. Nature 521:452–459

    Google Scholar 

  • Gitman I, Ginsburg B (2017) Comparison of batch and weight normalization algorithms for largescale image classification, arXiv:1709.08145

  • Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: International conference on machine learning

  • Goodfellow IJ, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial examples. CoRR. arXiv:1412.6572

  • Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge

    MATH  Google Scholar 

  • Graham B (2015) Fractional max-pooling, arXiv:1412.6071

  • He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification, arXiv:1502.01852

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, arXiv:1512.03385v1

  • Helmstaedter M, Briggman KL, Turaga SC, Jain V, Seung HS, Denk W (2013) Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature 500:168–174

    Google Scholar 

  • Henke N, Bughin J, Chui M, Manyika J, Saleh T, Wiseman B, Sethupathy G (2016) The age of analytics: competing in a data-driven world. In: McKinsey Global Institute

  • Hinton G, Deng L, Yu D, Dahl G, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012a) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Google Scholar 

  • Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012b) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 29(6):82–97

    Google Scholar 

  • Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012c) Improving neural networks by preventing co-adaptation of feature detectors, arXiv:1207.0580

  • Hochreiter S, Schmidhuber J (1995) Simplifying neural nets by discovering flat minima. In: Advances in neural information processing systems, vol 7

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Google Scholar 

  • Huang G, Liu Z, Weinberger KQ, Maaten L (2016a) Densely connected convolutional networks, arXiv:1608.06993

  • Huang G, Sun Y, Liu Z, Sedra D, Weinberger K (2016b) Deep networks with stochastic depth, arXiv:1603.09382

  • Huang L, Liu X, Lang B, Yu AW, Wang W, Li B (2017) Orthogonal weight normalization: solution to optimization over multiple dependent stiefel manifolds in deep neural networks, arXiv:1709.06079

  • Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167

  • Jakubovitz D, Giryes R (2018) Improving DNN robustness to adversarial attacks using jacobian regularization. In: European conference on computer vision

  • Kang G, Li J, Tao D (2016) Shakeout: a new regularized deep neural network training scheme. In: Proceedings of the thirtieth AAAI conference on artificial intelligence

  • Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems

  • Krizhevsky A, Sutskever I, Hinton G (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Google Scholar 

  • Laarhoven TV (2017) L2 regularization versus batch and weight normalization. arXiv:1706.05350

  • Larsson G, Maire M, Shakhnarovich G (2017) FractalNet: ultra-deep neural networks without residuals, arXiv:1605.07648

  • LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444

    Google Scholar 

  • Ma J, Sheridan RP, Liaw A, Dahl GE, Svetnik V (2015) Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model 55(2):263–274

    Google Scholar 

  • Maeda SI (2014) A Bayesian encourages dropout, arXiv:1412.7003

  • Mash R, Borghetti B, Pecarina J (2016) Improved aircraft recognition for aerial refueling through data augmentation in convolutional neural networks. In: International symposium on visual computing

  • Moradi R, Berangi R, Minaei B (2019) SparseMaps: convolutional networks with sparse feature maps for tiny image classification. Expert Syst Appl 119:142–154

    Google Scholar 

  • Morerio P, Cavazza J, Volpi R, Vidal R, Murino V (2017) Curriculum dropout, arXiv:1703.06229

  • Ng AY (1997) Preventing “overfitting” of cross-validation data. In: International conference on machine learning

  • Peng H, Mou L, Li G, Chen Y, Lu Y, Jin Z (2015) A comparative study on regularization strategies for embedding-based neural networks. In: Empirical methods in natural language processing

  • Poole B, Sohl-Dickstein J, Ganguli S (2014) Analyzing noise in autoencoders and deep networks. In: CoRR

  • Roth K, Lucchi A, Nowozin S, Hofmann T (2018) Adversarially Robust training through structured gradient regularization. arXiv:1805.08736v1

  • Rozsa A, Rudd EM, Boult TE (2016) Adversarial diversity and hard positive generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

  • Salimans T, Kingma DP (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks, arXiv:1602.07868

  • Sankaranarayanan S, Jain A, Chellappa R, Lim SN (2018) Regularizing deep networks using efficient layerwise adversarial training. In: AAAI conference on artificial intelligence

  • Santurkar S, Tsipras D, Ilyas A, Madry A (2018) How does batch normalization help optimization? arXiv:1805.11604

  • Shalev-Shwartz S, Ben-David S (2014) Rademacher complexities. In: Understanding machine learning—from theory to algorithms. Cambridge University Press, Cambridge, pp 325–336

  • Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014a) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  • Srivastava N, Hinton GE, Krizhevsky A, Sutskever I (2014b) A simple way to prevent neural network to prevent overfitting. Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  • Su J, Vargas DV, Kouichi S (2017) One pixel attack for fooling deep neural networks, arXiv:1710.08864

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems

  • Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R (2013) Intriguing properties of neural networks, arXiv:1312.6199

  • Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions, arXiv:1409.4842

  • Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision, arXiv:1512.00567

  • Taylor L, Nitschke G (2017) Improving deep learning using generic data augmentation, arXiv:1708.06020

  • Wager S, Wang S, Liang PS (2013) Dropout training as adaptive regularizationauthor. In: Advances in neural information processing systems

  • Wan L, Zeiler M, Zhang S, LeCun Y, Fergus R (2013) Regularization of neural networks using DropConnect. In: ICML, Department of Computer Science, Courant Institute of Mathematical Science, New York University, [Online]. Available: https://cs.nyu.edu/~wanli/dropc/

  • Wang Q, JaJa J (2013) From maxout to Channel-Out: Encoding information on sparse pathways, arXiv:1312.1909

  • Wang S, Manning C (2013) Fast dropout training. In: International conference on machine learning

  • Wen W, Wu C, Wang W, Chen Y, Li H (2016) Learning structured sparsity in deep neural networks. In: Advances in neural information processing systems

  • Wolpert D (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390

    Google Scholar 

  • Wu Y, He K (2018) Group normalization, arXiv:1803.08494

  • Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RKC, Frey BJ (2015) The human splicing code reveals new insights into the genetic determinants of disease. Science 347(6218):1254806

    Google Scholar 

  • Yu K, Xu W, Gong Y (2009) Deep learning with kernel regularization for visual recognition. In: Advances in neural information processing systems, vol 21

  • Yuan X, He P, Zhu Q, Bhat RR, Li X (2017) Adversarial examples: attacks and defenses for deep learning, arXiv:1712.07107

  • Zeiler MD, Fergus R (2013) Stochastic pooling for regularization of deep convolutional neural networks, arXiv:1301.3557

  • Zeiler M, Fergus R (2014) Visualizing and understanding convolutional networks. In: IEEE European conference on computer vision

  • Zhao Z, Dua D, Singh S (2017) Generating natural adversarial examples, arXiv:1710.11342

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reza Berangi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moradi, R., Berangi, R. & Minaei, B. A survey of regularization strategies for deep models. Artif Intell Rev 53, 3947–3986 (2020). https://doi.org/10.1007/s10462-019-09784-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-019-09784-7

Keywords

Navigation