Rademacher dropout: An adaptive dropout for deep neural network via optimizing generalization gap
Introduction
Deep learning has achieved great success in a number of domains, such as the image processing [1], [2], text analysis [3], and control fields [4]. However, the excessive feature learning by deep neural networks (DNNs) might lead to the overfitting phenomenon, which reduces the generalization ability of deep models. Motivated by Kerkar et al. [5], [6], the generalization ability of DNNs can be quantitatively measured by the generalization gap, which is the difference between the empirical risk (training error) and the expected risk (generalization error). Moreover, the generalization gap of DNNs [7] can be formulated aswhere Rexp and Remp represent the expected risk and the empirical risk of deep models, respectively. Once a deep model is overfitting, Rexp will be much larger than Remp and its generalization gap is written as
To enhance the generalization ability of DNNs, various techniques have been developed, such as ensemble learning [8], batch normalization (BN) [9], and dropout [10], [11]. Ensemble learning prevents overfitting via a combination of multiple classifiers [12]. BN was proposed to reduce the internal covariate shift through the normalization operations on the batches of each layer. However, ensemble learning requires expensive computational resources and the performance of BN depends on the batch-size [9]. To improve the generalization ability of DNNs stably and efficiently, in this paper, we focus on studying dropout for its simplicity and remarkable effectiveness. Dropout is one of the most widely adopted regularization approaches in deep learning [13], [14], [15], [16]. The strategy of dropout is to randomly drop neurons within one layer while training DNNs [10]. Then, the generalization ability of a deep model is improved by breaking the fixed combination of features.
Despite its empirical success, current theoretical analysis of the dropout technique remains rudimentary and vague [17]. This can be explained by the fact that the fundamental theory of the DNN still remains a riddle, which is called ‘Black Box’ [18]. Therefore, the effectiveness of dropout mechanism can only be estimated from some existing techniques, such as Bayesian theory [19], optimization analysis [20], and statistic generalization bound [21], [22]. For linear models, dropout training originally was analyzed as ensemble learning in shallow networks [11]. Moreover, Warde-Farley et al. [16] verified the effectiveness of the geometric average approximation, which combines the training results from multiple sub-networks. Motivated by norm regularization theory, Wager et al. [23] analyzed, for generalized linear models, that dropout is the first-order equivalent to an ℓ2 regularizer. For deep models, Gal et al. [19], [24] proposed probabilistic interpretations for drop training, which proved that DNNs trained with dropout are mathematically equivalent to an approximation to a well-known Bayesian model. Recently, a battery of studies has emerged that attempt to explain dropout training by risk bound analysis and Rademacher Complexity [25]. Unlike those data-independent complexities, Rademacher Complexity can attain a much more compact generalization representation [26]. For DNNs trained with dropout, Mou et al. [17] derived that the generalization error is bounded by the sum of two offset Rademacher Complexities. Gao et al. [21] developed Rademacher Complexity into Dropout Rademacher Complexity and obtained a compact estimation of the expected risk.
Although the theoretical analysis is still vague, empirical experiments show that the effect of dropout is intrinsically related to the choice of dropout rates [20]. For convenience, the traditional dropout method made an assumption that the distribution of dropout rates obeys the Binomial Bernoulli distribution. Based on this viewpoint, traditional dropout decides the value of dropout rates empirically, which sets the value of dropout rates by some rule-of-thumb [10], [11]. Meanwhile, the traditional dropout method treats each neuron in one layer equally, where each neuron shares the same dropout rates. However, different neurons represent different features and they contribute to the prediction to different extents [20], [27]. Combining the aforementioned insights, there is still room to improve traditional dropout method by adaptively choosing dropout rates for DNNs.
To improve the generalization performance of DNNs, a number of variants of dropout are proposed to design adaptive mechanisms for the update of dropout rates. From the probabilistic viewpoint, the variants of dropout mainly concentrated on studying the distribution of dropout rates. Some researchers assumed that the dropout rates obey the specific distribution as a prior [10], [20], [27]. For example, Ba et al. [27] held that dropout rates obey the binomial Bernoulli distribution and constructed a binary belief network over the DNNs, which generates the dropout rates adaptively by minimizing the energy function. However, the additional binary belief network will result in more computational overhead when the model size increases. Moreover, Li et al. [20] sampled dropout rates from a prior multinomial distribution and proposed an evolutionary dropout via risk bound analysis for optimization of a Stochastic Gradient Descent (SGD) learning algorithm [28]. Aside from assuming the prior distribution, another category of dropout variants attempts to estimate the distribution of dropout rates via some optimization framework [17], [19], [22], [24], [29], [30]. Based on the Bayesian optimization framework, a variety of dropout methods were proposed [19], [24], [30], [31]. Based on deep Bayesian learning, Gal et al. [29] proposed “concrete dropout” to give improved performance and better calibrated uncertainties. Through variational Bayesian inference theory, Kingma et al. [30] explored an extended version of Gaussion dropout called “variational dropout” with local re-parameterization. Recently, an increasing number of studies have tried to estimate the distribution of dropout rates via risk bound optimization [17], [22]. Through Rademacher Complexity, Zhai et al. [22] proposed an adaptive dropout regularizer on the objective function. It is worth noting that our research is fundamentally different from their work, in that they take the term of Rademacher Complexity as the regularizer on the objective function of DNNs and we utilize Rademacher Complexity to estimate and optimize the generalization gap.
In this paper, we propose a novel method to achieve the adaptive adjustment of dropout rates with low computational complexity. In fact, estimating the distribution of dropout rates directly is a challenging task as a grid search over a large amount of hyper parameters [22]. As an alternative, we attempt to estimate the distribution of dropout rates via optimizing the generalization gap. Inspired by the estimation of Gao et al. [21], we first prove that the generalization gap of DNNs trained with dropout is bounded by a constraint function related to dropout rates. Subsequently, to optimize the generalization gap, we minimize the constraint function by a theoretical derivation. As a result, we obtain a closed-form solution in the optimization process, in which the solution represents a distribution estimation of the dropout rates. This solution provides an efficient and concise approach to calculate dropout rates by the batch inputs of the dropout layer. Finally, we propose an adaptive algorithm called Rademacher Dropout (RadDropout) based on the closed-form solution. The algorithm generates the dropout rates adaptively during feed-forward propagation and requires only lightweight complexity. To further justify our method, we conduct several experiments on five benchmark datasets: MNIST, SVHN, NORB, Cifar-100, and TinyImagenet. The scope of the experiments includes a comparison of several traditional and state-of-art dropout approaches. The experimental results illustrate that RadDropout improves both on convergence rate and prediction accuracy.
The main contributions of this paper as follows:
- •
We first prove that the generalization gap of DNNs trained with dropout is bounded by a constraint function related to dropout rates.
- •
We optimize the generalization gap by a theoretical derivation. Meanwhile, we obtain a closed-form solution as an distribution estimation of dropout rates.
- •
We propose RadDropout as a novel dropout algorithm based on our solution, which achieves adaptive adjustment of dropout rates with lightweight complexity.
The reminder of this paper is organized as follows. In Section 2, we present some preliminaries. In Section 3, we detail theoretical derivations and the proposed adaptive algorithm, RadDropout. We present the experimental results on five datasets in Section 4 and draw conclusions in Section 5.
Section snippets
Expression of DNN
Here, we give the expression of the structure of a fully connected network:
- •
The deep network is composed of L hidden layers: and layer hi has Ni neurons ().
- •
The weights between layer hi and layer are Wi, where (1 ≤ i ≤ L). Moreover, where ().
- •
The dropout operation can be considered as a element-wise product of the input vector and the bitmask vector, which is generated by the dropout rates. Following Warde-Farley et al.
Rademacher dropout: optimization of generalization gap
Before further analyzing the generalization gap, we emphasize a supposition as a prior. For the convenience of derivation, here we make a supposition that only one hidden layer hs has a dropout operation. Since the expression of DNN is in recursive form, the situation of dropout operation for one hidden layer can be easily popularized to the multi-layer dropout.
Experiment
In this section, we first perform experiments to make a primary validation on the effectiveness of the proposed algorithm, namely RadDropout, on three benchmark datasets in the image recognition area: MNIST,1 NORB,2 and SVHN.3 Then, we further verify the proposed RadDropout on two comparatively large and complex datasets: Cifar-100 dataset4
Conclusions
In this paper, we propose a novel dropout method to achieve the adaptive adjustment of dropout rates. Based on Dropout Rademacher Complexity, we first prove that the generalization gap is bounded by a constraint function related to dropout rates. By a theoretical derivation, we minimize the constraint function and derive a closed-form solution as the distribution estimation of dropout rates. As a result, we propose an adaptive dropout algorithm called Rademacher Dropout based on the closed-form
Conflict of Interest
The authors declare that there is no conflict of interests regarding the publication of this article.
Acknowledgement
This work was supported by the National Science Foundation of China (Grant Nos. 91648204 and 11701566).
Haotian Wang was born in Anhui, China, in 1995. He received the B.E. degree in computer science and technology from National University of Defense Technology, Changsha, China, in 2017. He is currently pursuing the M.E. degree in computer science and technology with the National University of Defense Technology, Changsha, China. His research interests include machine learning and computer vision.
References (48)
- et al.
Deep sequential fusion LSTM network for image description
Neurocomputing
(2018) - et al.
Optimization of deep convolutional neural network for large scale image retrieval
Neurocomputing
(2018) - et al.
Concise deep reinforcement learning obstacle avoidance for underactuated unmanned marine vessels
Neurocomputing
(2018) - et al.
When ensemble learning meets deep learning: a new deep support vector machine for classification
Knowl.-Based Syst.
(2016) - et al.
Towards perfect text classification with Wikipedia-based semantic naïve Bayes learning
Neurocomputing
(2018) - et al.
On large-batch training for deep learning: Generalization gap and sharp minima
The International Conference on Learning Representations (ICLR)
(2017) - et al.
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
Proceedings of the Advances in Neural Information Processing Systems
(2017) - K. Kawaguchi, L.P. Kaelbling, Y. Bengio, Generalization in deep learning, arXiv:1710.05468...
- et al.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
International Conference on Machine Learning
(2015) - G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R.R. Salakhutdinov, Improving neural networks by preventing...
Dropout: a simple way to prevent neural networks from overfitting
J. Mach. Learn. Res.
Ensemble learning using decorrelated neural networks
Connect. Sci.
Dropout improves recurrent neural networks for handwriting recognition
Proceedings of the 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Altitude training: Strong bounds for single-layer dropout
Proceedings of the Advances in Neural Information Processing Systems
Distance metric learning using dropout: a structured regularization approach
Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
An empirical analysis of dropout in piecewise linear networks
The International Conference on Learning Representations (ICLR)
Dropout training, data-dependent regularization, and generalization bounds
Proceedings of the International Conference on Machine Learning
Improved dropout for shallow and deep learning
Proceedings of the Advances in Neural Information Processing Systems
Dropout rademacher complexity of deep neural networks
Sci. China Inf. Sci.
Dropout training as adaptive regularization
Proceedings of the Advances in Neural Information Processing Systems
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning
Proceedings of the International Conference on Machine Learning
Cited by (0)
Haotian Wang was born in Anhui, China, in 1995. He received the B.E. degree in computer science and technology from National University of Defense Technology, Changsha, China, in 2017. He is currently pursuing the M.E. degree in computer science and technology with the National University of Defense Technology, Changsha, China. His research interests include machine learning and computer vision.
Yang Wenjing was born in Changsha, China. She received the Ph.D. degree in Multi-scale modelling from Manchester University. She is currently an associate research fellow in State Key Laboratory of High Performance Computing, National University of Defense Technology. Her research interests include machine learning, robotics software and high-performance computing.
Zhenyu Zhao received the B.S. degree in mathematics from University of Science and Technology of China in 2009, and the Ph.D. degree in applied mathematics from National University of Defense and Technology in 2016. Currently, he is a lecturer at College of Liberal Arts and Sciences, National University of Defense and Technology. His research interests include computer vision, pattern recognition and machine learning.
Tingjin Luo received the Ph.D. degree in College of Science from National University of Defense Technology, Changsha, China, in 2018. He has received the B.S. and master’s degrees from the College of Information System and Management, National University of Defense Technology, Changsha, China, in 2011 and 2013, respectively. He was a visiting Ph.D. student with the University of Michigan, Ann Arbor, MI, USA, from 2015 to 2017. He is currently an Assistant Professor with the College of Science, National University of Defense Technology. He has authored several papers in journals and conferences, such as IEEE TKDE, IEEE TCYB, IEEE TIP, Scientific Reports and KDD 2017. His current research interests include machine learning, multimedia analysis, optimization, and computer vision.
Ji Wang is a professor in the State Key Laboratory of High Performance Computing, School of Computer, National University of Defense Technology. He received his PhD in Computer Science from National University of Defense Technology. His research interests include programming methodology, formal methods and software engineering for smart systems.
Yuhua Tang received the BS and MS degrees from the Department of Computer Science, National University of Defense Technology, China, in 1983 and 1986, respectively. She is currently a professor in the State Key Laboratory of High Performance Computing, National University of Defense Technology. Her research interests include supercomputer architecture and core router’s design.