Abstract
The most critical concern in machine learning is how to make an algorithm that performs well both on training data and new data. No free lunch theorem implies that each specific task needs its own tailored machine learning algorithm to be designed. A set of strategies and preferences are built into learning machines to tune them for the problem at hand. These strategies and preferences, with the core concern of generalization improvement, are collectively known as regularization. In deep learning, because of a considerable number of parameters, a great many forms of regularization methods are available to the deep learning community. Developing more effective regularization strategies has been the subject of significant research efforts in recent years. However, it is difficult for developers to choose the most suitable strategy for their problem at hand, because there is no comparative study regarding the performance of different strategies. In this paper, at the first step, the most effective regularization methods and their variants are presented and analyzed in a systematic approach. At the second step, comparative research on regularization techniques is presented in which the testing errors and computational costs are evaluated in a convolutional neural network, using CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar.html) dataset. In the end, different regularization methods are compared in terms of accuracy of the network, the number of epochs for the network to be trained and the number of operations per input sample. Also, the results are discussed and interpreted based on the employed strategy. The experiment results showed that weight decay and data augmentation regularizations have little computational side effects so can be used in most applications. In the case of enough computational resources, Dropout family methods are rational to be used. Moreover, in the case of abundant computational resources, batch normalization family and ensemble methods are reasonable strategies to be employed.
Similar content being viewed by others
Notes
A learning algorithm that compares new problem instances with instances in the training set.
A model in which a graph expresses the conditional dependence structure between random variables.
Natural Language Processing.
Part-of-speech tagging.
Name entity recognition.
semantic-role labeling.
Dense Convolutional Network.
Long Short-Term Memory.
References
Aha D, Kibler D, Albert M (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization, arXiv:1607.06450
Bartle RG (1995) The elements of integration and Lebesgue measure. Wiley, New Yor
Bordes A, Chopra S, Weston J (2014) Question answering with subgraph embeddings. In: Empirical methods in natural language processing
Bouthillier X, Konda K, Vincent P, Memisevic R (2015) Dropout as data augmentation, arXiv:1506.08700
Breiman L (1994) Bagging predictors. Mach Learn, pp 123–140
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: British machine vision
Chen PY, Zhang H, Sharma Y, Yi J, Hsieh CJ (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proceedings of the 10th ACM workshop on artificial intelligence and security
Cohen D, Mitra B, Hofmann K, Croft WB (2018) Cross domain regularization for neural ranking models using adversarial learning, arXiv:1805.03403v1
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
DeVries T, Taylor GW (2017) Improved regularization of convolutional neural networks with cutout, arXiv:1708.04552, 2017
Domingos P (2000) A unified bias-variance decomposition and its applications. In: International conference on machine learning
Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55(10):78–87
Dong Y, Liao F, Pang T, Su H, Hu X, Li J, Zhu J (2017) Boosting adversarial attacks with momentum, arXiv:1710.06081
Erhan D, Manzagol PA, Bengio Y, Bengio S, Vincent P (2009) The difficulty of training deep architectures and the effect of unsupervised pre-training. In: AISTATS
Frazão XF, Alexandre LA (2014) DropAll: generalization of two convolutional neural network regularization methods. In: International conference on image analysis and recognition
Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainity in deep learning. In: Proceedings of the international conference on machine learning
Gastaldi X (2017) Shake–shake regularization, arXiv:1705.07485
Ghahramani Z (2015) Probabilistic machine learning and artificial intelligence. Nature 521:452–459
Gitman I, Ginsburg B (2017) Comparison of batch and weight normalization algorithms for largescale image classification, arXiv:1709.08145
Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: International conference on machine learning
Goodfellow IJ, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial examples. CoRR. arXiv:1412.6572
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
Graham B (2015) Fractional max-pooling, arXiv:1412.6071
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification, arXiv:1502.01852
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, arXiv:1512.03385v1
Helmstaedter M, Briggman KL, Turaga SC, Jain V, Seung HS, Denk W (2013) Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature 500:168–174
Henke N, Bughin J, Chui M, Manyika J, Saleh T, Wiseman B, Sethupathy G (2016) The age of analytics: competing in a data-driven world. In: McKinsey Global Institute
Hinton G, Deng L, Yu D, Dahl G, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012a) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012b) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 29(6):82–97
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012c) Improving neural networks by preventing co-adaptation of feature detectors, arXiv:1207.0580
Hochreiter S, Schmidhuber J (1995) Simplifying neural nets by discovering flat minima. In: Advances in neural information processing systems, vol 7
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Huang G, Liu Z, Weinberger KQ, Maaten L (2016a) Densely connected convolutional networks, arXiv:1608.06993
Huang G, Sun Y, Liu Z, Sedra D, Weinberger K (2016b) Deep networks with stochastic depth, arXiv:1603.09382
Huang L, Liu X, Lang B, Yu AW, Wang W, Li B (2017) Orthogonal weight normalization: solution to optimization over multiple dependent stiefel manifolds in deep neural networks, arXiv:1709.06079
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Jakubovitz D, Giryes R (2018) Improving DNN robustness to adversarial attacks using jacobian regularization. In: European conference on computer vision
Kang G, Li J, Tao D (2016) Shakeout: a new regularized deep neural network training scheme. In: Proceedings of the thirtieth AAAI conference on artificial intelligence
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems
Krizhevsky A, Sutskever I, Hinton G (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Laarhoven TV (2017) L2 regularization versus batch and weight normalization. arXiv:1706.05350
Larsson G, Maire M, Shakhnarovich G (2017) FractalNet: ultra-deep neural networks without residuals, arXiv:1605.07648
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Ma J, Sheridan RP, Liaw A, Dahl GE, Svetnik V (2015) Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model 55(2):263–274
Maeda SI (2014) A Bayesian encourages dropout, arXiv:1412.7003
Mash R, Borghetti B, Pecarina J (2016) Improved aircraft recognition for aerial refueling through data augmentation in convolutional neural networks. In: International symposium on visual computing
Moradi R, Berangi R, Minaei B (2019) SparseMaps: convolutional networks with sparse feature maps for tiny image classification. Expert Syst Appl 119:142–154
Morerio P, Cavazza J, Volpi R, Vidal R, Murino V (2017) Curriculum dropout, arXiv:1703.06229
Ng AY (1997) Preventing “overfitting” of cross-validation data. In: International conference on machine learning
Peng H, Mou L, Li G, Chen Y, Lu Y, Jin Z (2015) A comparative study on regularization strategies for embedding-based neural networks. In: Empirical methods in natural language processing
Poole B, Sohl-Dickstein J, Ganguli S (2014) Analyzing noise in autoencoders and deep networks. In: CoRR
Roth K, Lucchi A, Nowozin S, Hofmann T (2018) Adversarially Robust training through structured gradient regularization. arXiv:1805.08736v1
Rozsa A, Rudd EM, Boult TE (2016) Adversarial diversity and hard positive generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops
Salimans T, Kingma DP (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks, arXiv:1602.07868
Sankaranarayanan S, Jain A, Chellappa R, Lim SN (2018) Regularizing deep networks using efficient layerwise adversarial training. In: AAAI conference on artificial intelligence
Santurkar S, Tsipras D, Ilyas A, Madry A (2018) How does batch normalization help optimization? arXiv:1805.11604
Shalev-Shwartz S, Ben-David S (2014) Rademacher complexities. In: Understanding machine learning—from theory to algorithms. Cambridge University Press, Cambridge, pp 325–336
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014a) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I (2014b) A simple way to prevent neural network to prevent overfitting. Mach Learn Res 15(1):1929–1958
Su J, Vargas DV, Kouichi S (2017) One pixel attack for fooling deep neural networks, arXiv:1710.08864
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R (2013) Intriguing properties of neural networks, arXiv:1312.6199
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions, arXiv:1409.4842
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision, arXiv:1512.00567
Taylor L, Nitschke G (2017) Improving deep learning using generic data augmentation, arXiv:1708.06020
Wager S, Wang S, Liang PS (2013) Dropout training as adaptive regularizationauthor. In: Advances in neural information processing systems
Wan L, Zeiler M, Zhang S, LeCun Y, Fergus R (2013) Regularization of neural networks using DropConnect. In: ICML, Department of Computer Science, Courant Institute of Mathematical Science, New York University, [Online]. Available: https://cs.nyu.edu/~wanli/dropc/
Wang Q, JaJa J (2013) From maxout to Channel-Out: Encoding information on sparse pathways, arXiv:1312.1909
Wang S, Manning C (2013) Fast dropout training. In: International conference on machine learning
Wen W, Wu C, Wang W, Chen Y, Li H (2016) Learning structured sparsity in deep neural networks. In: Advances in neural information processing systems
Wolpert D (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390
Wu Y, He K (2018) Group normalization, arXiv:1803.08494
Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RKC, Frey BJ (2015) The human splicing code reveals new insights into the genetic determinants of disease. Science 347(6218):1254806
Yu K, Xu W, Gong Y (2009) Deep learning with kernel regularization for visual recognition. In: Advances in neural information processing systems, vol 21
Yuan X, He P, Zhu Q, Bhat RR, Li X (2017) Adversarial examples: attacks and defenses for deep learning, arXiv:1712.07107
Zeiler MD, Fergus R (2013) Stochastic pooling for regularization of deep convolutional neural networks, arXiv:1301.3557
Zeiler M, Fergus R (2014) Visualizing and understanding convolutional networks. In: IEEE European conference on computer vision
Zhao Z, Dua D, Singh S (2017) Generating natural adversarial examples, arXiv:1710.11342
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Moradi, R., Berangi, R. & Minaei, B. A survey of regularization strategies for deep models. Artif Intell Rev 53, 3947–3986 (2020). https://doi.org/10.1007/s10462-019-09784-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-019-09784-7