Elsevier

Neurocomputing

Volume 357, 10 September 2019, Pages 177-187
Neurocomputing

Rademacher dropout: An adaptive dropout for deep neural network via optimizing generalization gap

https://doi.org/10.1016/j.neucom.2019.05.008Get rights and content

Abstract

Dropout plays an important role in improving the generalization ability in deep learning. However, the empirical and fixed choice of dropout rates in traditional dropout strategies may increase the generalization gap, which is counter to one of the principle aims of dropout. To handle this problem, in this paper, we propose a novel dropout method. By the theoretical analysis of Dropout Rademacher Complexity, we first prove that the generalization gap of a deep model is bounded by a constraint function related to dropout rates. Meanwhile, we derive a closed form solution via optimizing the constraint function, which is a distribution estimation of dropout rates. Based on the closed form solution, a lightweight complexity algorithm called Rademacher Dropout (RadDropout) is presented to achieve the adaptive adjustment of dropout rates. Moreover, as a verification of the effectiveness of our proposed method, the extensive experimental results on benchmark datasets show that RadDropout achieves improvement of both convergence rate and prediction accuracy.

Introduction

Deep learning has achieved great success in a number of domains, such as the image processing [1], [2], text analysis [3], and control fields [4]. However, the excessive feature learning by deep neural networks (DNNs) might lead to the overfitting phenomenon, which reduces the generalization ability of deep models. Motivated by Kerkar et al. [5], [6], the generalization ability of DNNs can be quantitatively measured by the generalization gap, which is the difference between the empirical risk (training error) and the expected risk (generalization error). Moreover, the generalization gap of DNNs [7] can be formulated asthegeneralizationgap|RexpRemp|,where Rexp and Remp represent the expected risk and the empirical risk of deep models, respectively. Once a deep model is overfitting, Rexp will be much larger than Remp and its generalization gap is written asthegeneralizationgap=RexpRemp.

To enhance the generalization ability of DNNs, various techniques have been developed, such as ensemble learning [8], batch normalization (BN) [9], and dropout [10], [11]. Ensemble learning prevents overfitting via a combination of multiple classifiers [12]. BN was proposed to reduce the internal covariate shift through the normalization operations on the batches of each layer. However, ensemble learning requires expensive computational resources and the performance of BN depends on the batch-size [9]. To improve the generalization ability of DNNs stably and efficiently, in this paper, we focus on studying dropout for its simplicity and remarkable effectiveness. Dropout is one of the most widely adopted regularization approaches in deep learning [13], [14], [15], [16]. The strategy of dropout is to randomly drop neurons within one layer while training DNNs [10]. Then, the generalization ability of a deep model is improved by breaking the fixed combination of features.

Despite its empirical success, current theoretical analysis of the dropout technique remains rudimentary and vague [17]. This can be explained by the fact that the fundamental theory of the DNN still remains a riddle, which is called ‘Black Box’ [18]. Therefore, the effectiveness of dropout mechanism can only be estimated from some existing techniques, such as Bayesian theory [19], optimization analysis [20], and statistic generalization bound [21], [22]. For linear models, dropout training originally was analyzed as ensemble learning in shallow networks [11]. Moreover, Warde-Farley et al. [16] verified the effectiveness of the geometric average approximation, which combines the training results from multiple sub-networks. Motivated by norm regularization theory, Wager et al. [23] analyzed, for generalized linear models, that dropout is the first-order equivalent to an ℓ2 regularizer. For deep models, Gal et al. [19], [24] proposed probabilistic interpretations for drop training, which proved that DNNs trained with dropout are mathematically equivalent to an approximation to a well-known Bayesian model. Recently, a battery of studies has emerged that attempt to explain dropout training by risk bound analysis and Rademacher Complexity [25]. Unlike those data-independent complexities, Rademacher Complexity can attain a much more compact generalization representation [26]. For DNNs trained with dropout, Mou et al. [17] derived that the generalization error is bounded by the sum of two offset Rademacher Complexities. Gao et al. [21] developed Rademacher Complexity into Dropout Rademacher Complexity and obtained a compact estimation of the expected risk.

Although the theoretical analysis is still vague, empirical experiments show that the effect of dropout is intrinsically related to the choice of dropout rates [20]. For convenience, the traditional dropout method made an assumption that the distribution of dropout rates obeys the Binomial Bernoulli distribution. Based on this viewpoint, traditional dropout decides the value of dropout rates empirically, which sets the value of dropout rates by some rule-of-thumb [10], [11]. Meanwhile, the traditional dropout method treats each neuron in one layer equally, where each neuron shares the same dropout rates. However, different neurons represent different features and they contribute to the prediction to different extents [20], [27]. Combining the aforementioned insights, there is still room to improve traditional dropout method by adaptively choosing dropout rates for DNNs.

To improve the generalization performance of DNNs, a number of variants of dropout are proposed to design adaptive mechanisms for the update of dropout rates. From the probabilistic viewpoint, the variants of dropout mainly concentrated on studying the distribution of dropout rates. Some researchers assumed that the dropout rates obey the specific distribution as a prior [10], [20], [27]. For example, Ba et al. [27] held that dropout rates obey the binomial Bernoulli distribution and constructed a binary belief network over the DNNs, which generates the dropout rates adaptively by minimizing the energy function. However, the additional binary belief network will result in more computational overhead when the model size increases. Moreover, Li et al. [20] sampled dropout rates from a prior multinomial distribution and proposed an evolutionary dropout via risk bound analysis for optimization of a Stochastic Gradient Descent (SGD) learning algorithm [28]. Aside from assuming the prior distribution, another category of dropout variants attempts to estimate the distribution of dropout rates via some optimization framework [17], [19], [22], [24], [29], [30]. Based on the Bayesian optimization framework, a variety of dropout methods were proposed [19], [24], [30], [31]. Based on deep Bayesian learning, Gal et al. [29] proposed “concrete dropout” to give improved performance and better calibrated uncertainties. Through variational Bayesian inference theory, Kingma et al. [30] explored an extended version of Gaussion dropout called “variational dropout” with local re-parameterization. Recently, an increasing number of studies have tried to estimate the distribution of dropout rates via risk bound optimization [17], [22]. Through Rademacher Complexity, Zhai et al. [22] proposed an adaptive dropout regularizer on the objective function. It is worth noting that our research is fundamentally different from their work, in that they take the term of Rademacher Complexity as the regularizer on the objective function of DNNs and we utilize Rademacher Complexity to estimate and optimize the generalization gap.

In this paper, we propose a novel method to achieve the adaptive adjustment of dropout rates with low computational complexity. In fact, estimating the distribution of dropout rates directly is a challenging task as a grid search over a large amount of hyper parameters [22]. As an alternative, we attempt to estimate the distribution of dropout rates via optimizing the generalization gap. Inspired by the estimation of Gao et al. [21], we first prove that the generalization gap of DNNs trained with dropout is bounded by a constraint function related to dropout rates. Subsequently, to optimize the generalization gap, we minimize the constraint function by a theoretical derivation. As a result, we obtain a closed-form solution in the optimization process, in which the solution represents a distribution estimation of the dropout rates. This solution provides an efficient and concise approach to calculate dropout rates by the batch inputs of the dropout layer. Finally, we propose an adaptive algorithm called Rademacher Dropout (RadDropout) based on the closed-form solution. The algorithm generates the dropout rates adaptively during feed-forward propagation and requires only lightweight complexity. To further justify our method, we conduct several experiments on five benchmark datasets: MNIST, SVHN, NORB, Cifar-100, and TinyImagenet. The scope of the experiments includes a comparison of several traditional and state-of-art dropout approaches. The experimental results illustrate that RadDropout improves both on convergence rate and prediction accuracy.

The main contributions of this paper as follows:

  • We first prove that the generalization gap of DNNs trained with dropout is bounded by a constraint function related to dropout rates.

  • We optimize the generalization gap by a theoretical derivation. Meanwhile, we obtain a closed-form solution as an distribution estimation of dropout rates.

  • We propose RadDropout as a novel dropout algorithm based on our solution, which achieves adaptive adjustment of dropout rates with lightweight complexity.

The reminder of this paper is organized as follows. In Section 2, we present some preliminaries. In Section 3, we detail theoretical derivations and the proposed adaptive algorithm, RadDropout. We present the experimental results on five datasets in Section 4 and draw conclusions in Section 5.

Section snippets

Expression of DNN

Here, we give the expression of the structure of a fully connected network:

  • The deep network is composed of L hidden layers: {h1,h2,,hL} and layer hi has Ni neurons (NL+1=1).

  • The weights between layer hi and layer hi+1 are Wi, where WiRNi×Ni+1 (1 ≤ i ≤ L). Moreover, Wi={W1i,W2i,,WNi+1i}, where WjiRNi×1 (1<=j<=Ni).

  • The dropout operation can be considered as a element-wise product of the input vector and the bitmask vector, which is generated by the dropout rates. Following Warde-Farley et al.

Rademacher dropout: optimization of generalization gap

Before further analyzing the generalization gap, we emphasize a supposition as a prior. For the convenience of derivation, here we make a supposition that only one hidden layer hs has a dropout operation. Since the expression of DNN is in recursive form, the situation of dropout operation for one hidden layer can be easily popularized to the multi-layer dropout.

Experiment

In this section, we first perform experiments to make a primary validation on the effectiveness of the proposed algorithm, namely RadDropout, on three benchmark datasets in the image recognition area: MNIST,1 NORB,2 and SVHN.3 Then, we further verify the proposed RadDropout on two comparatively large and complex datasets: Cifar-100 dataset4

Conclusions

In this paper, we propose a novel dropout method to achieve the adaptive adjustment of dropout rates. Based on Dropout Rademacher Complexity, we first prove that the generalization gap is bounded by a constraint function related to dropout rates. By a theoretical derivation, we minimize the constraint function and derive a closed-form solution as the distribution estimation of dropout rates. As a result, we propose an adaptive dropout algorithm called Rademacher Dropout based on the closed-form

Conflict of Interest

The authors declare that there is no conflict of interests regarding the publication of this article.

Acknowledgement

This work was supported by the National Science Foundation of China (Grant Nos. 91648204 and 11701566).

Haotian Wang was born in Anhui, China, in 1995. He received the B.E. degree in computer science and technology from National University of Defense Technology, Changsha, China, in 2017. He is currently pursuing the M.E. degree in computer science and technology with the National University of Defense Technology, Changsha, China. His research interests include machine learning and computer vision.

References (48)

  • N. Srivastava et al.

    Dropout: a simple way to prevent neural networks from overfitting

    J. Mach. Learn. Res.

    (2014)
  • B.E. Rosen

    Ensemble learning using decorrelated neural networks

    Connect. Sci.

    (1996)
  • V. Pham et al.

    Dropout improves recurrent neural networks for handwriting recognition

    Proceedings of the 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    (2014)
  • S. Wager et al.

    Altitude training: Strong bounds for single-layer dropout

    Proceedings of the Advances in Neural Information Processing Systems

    (2014)
  • QianQ. et al.

    Distance metric learning using dropout: a structured regularization approach

    Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2014)
  • D. Warde-Farley et al.

    An empirical analysis of dropout in piecewise linear networks

    The International Conference on Learning Representations (ICLR)

    (2017)
  • MouW. et al.

    Dropout training, data-dependent regularization, and generalization bounds

    Proceedings of the International Conference on Machine Learning

    (2018)
  • R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neural networks via information, arXiv:1703.00810...
  • Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximagao2016dropouttion, arXiv:1506.02157...
  • LiZ. et al.

    Improved dropout for shallow and deep learning

    Proceedings of the Advances in Neural Information Processing Systems

    (2016)
  • GaoW. et al.

    Dropout rademacher complexity of deep neural networks

    Sci. China Inf. Sci.

    (2016)
  • K. Zhai, H. Wang, Adaptive dropout with rademacher complexity regularization...
  • WagerS. et al.

    Dropout training as adaptive regularization

    Proceedings of the Advances in Neural Information Processing Systems

    (2013)
  • GalY. et al.

    Dropout as a Bayesian approximation: Representing model uncertainty in deep learning

    Proceedings of the International Conference on Machine Learning

    (2016)
  • Cited by (0)

    Haotian Wang was born in Anhui, China, in 1995. He received the B.E. degree in computer science and technology from National University of Defense Technology, Changsha, China, in 2017. He is currently pursuing the M.E. degree in computer science and technology with the National University of Defense Technology, Changsha, China. His research interests include machine learning and computer vision.

    Yang Wenjing was born in Changsha, China. She received the Ph.D. degree in Multi-scale modelling from Manchester University. She is currently an associate research fellow in State Key Laboratory of High Performance Computing, National University of Defense Technology. Her research interests include machine learning, robotics software and high-performance computing.

    Zhenyu Zhao received the B.S. degree in mathematics from University of Science and Technology of China in 2009, and the Ph.D. degree in applied mathematics from National University of Defense and Technology in 2016. Currently, he is a lecturer at College of Liberal Arts and Sciences, National University of Defense and Technology. His research interests include computer vision, pattern recognition and machine learning.

    Tingjin Luo received the Ph.D. degree in College of Science from National University of Defense Technology, Changsha, China, in 2018. He has received the B.S. and master’s degrees from the College of Information System and Management, National University of Defense Technology, Changsha, China, in 2011 and 2013, respectively. He was a visiting Ph.D. student with the University of Michigan, Ann Arbor, MI, USA, from 2015 to 2017. He is currently an Assistant Professor with the College of Science, National University of Defense Technology. He has authored several papers in journals and conferences, such as IEEE TKDE, IEEE TCYB, IEEE TIP, Scientific Reports and KDD 2017. His current research interests include machine learning, multimedia analysis, optimization, and computer vision.

    Ji Wang is a professor in the State Key Laboratory of High Performance Computing, School of Computer, National University of Defense Technology. He received his PhD in Computer Science from National University of Defense Technology. His research interests include programming methodology, formal methods and software engineering for smart systems.

    Yuhua Tang received the BS and MS degrees from the Department of Computer Science, National University of Defense Technology, China, in 1983 and 1986, respectively. She is currently a professor in the State Key Laboratory of High Performance Computing, National University of Defense Technology. Her research interests include supercomputer architecture and core router’s design.

    View full text