Pre-training the deep generative models with adaptive hyperparameter optimization
Introduction
Many machine learning algorithms are very sensitive to their hyperparameter settings. Traditionally, tuning hyperparameters is the job of human experts, which is laborious and difficult. When it comes to Deep Learning algorithms, the challenge is even higher, because there are so many hyperparameters, which govern not only the structures of the deep architectures but also their learning strategies and procedures.
A natural attempt is the grid search, which suffers from the exponential combinations of the hyperparameters. Later empirical studies show that the random search [1] is more efficient than the grid search. In recent years, the framework of Bayesian optimization (BO) [2], [3], [4], [5], [6] has shown success in finding optimal hyperparameter configurations within as less time budget as possible. These kinds of optimizers aim to build the surrogate models to discover the intrinsic relationship between the hyperparameters and a holdout loss. The most popular models based on this framework are sequential model-based algorithm configuration (SMAC) [2], tree of Parzen estimators (TPE) [5] and Spearmint [4]. They have been further extended to meet the requirement of hyperparameter optimization (HO) for deep models [7], [8], [9], in which the number of the hyperparameters is much larger and the training time is much longer than that of traditional machine learning algorithms.
Despite their achievements in the application of deep models, they still suffer from highly computational expense. The reason is that the BO methods treat a deep model as one black-box function, which has to be executed fully or partially for many times before the hyperparameters can be assessed. Another weakness for the traditional BO methods is that they cannot directly be applied to unsupervised learning tasks due to the absence of the validation errors.
This paper focuses on a specific class of deep models, the deep generative models (DGMs)[10], [11], [12], and provides a new adaptive hyperparameter optimization procedure at pre-training phase where we avoid combining all the layers as one black-box function. Following this procedure, our method not only speeds up the process of the HO in the supervised learning tasks, but also can be directly applied to the unsupervised learning tasks, where the traditional BO methods are hard to be implemented.
The classic architectures of the DGMs include deep belief network (DBN) [11] and deep Boltzmann machine (DBM) [12], [13]. They can successfully learn the distributed representation[10], [14] from large sets of unlabeled data by the greedily unsupervised layer-by-layer learning strategy [15], [16]. By taking advantage of this strategy, the inner loop information can be used when optimizing the hyperparameters. Specifically, when pre-training the DGMs in each epoch, we first fix the hyperparameters and learn the weights by the traditional procedure, then fix the weights and infer the best candidate of the hyperparameters for the next epoch by using Gaussian process (GP) [17]. The hyperparameters we can optimize simultaneously include the momentum, the learning rate, the weight cost, the sparsity cost, etc. In contrast to the traditional BO methods, which mainly focus on the supervised models, the pre-training follows the unsupervised procedure, where there is no holdout error can be adopted. An alternative choice is the probability of the observations. However, evaluating in the DGMs has to compute the so-called partition function, which is intractable [18]. [19], [20] proposed annealed importance sampling (AIS) to approximately estimate the partition function. However the algorithm itself is computationally intensive, because it usually has to iterate more than hundreds of times to gradually approximate the partition function.
This paper proposes a new holdout loss, the free energy gap (FEG), which can acted as the target when inferring the best candidate of the hyperparameters by GP. Here we combine two kinds of the FEGs, which are denoted as FEGf and FEGo. The FEGf is an indicator of the model fitting, while the FEGo can monitor the over-fitting. More specifically, we try to find the better hyperparameters which can potentially increase FEGf as high as possible and decrease FEGo as low as possible. However, in the real training process, the FEGf and FEGo follow the different steps of variation, which make it difficult to directly combine these two FEGs as one useful loss. To solve this problem, we introduce the technique that can evaluate exponential decay of the variation by use of the moving average of the 2nd moment, in order that the velocities of the FEGf and FEGo will be taken into account. The definitions of the FEGf and FEGo are given in detail in Section 4.
We perform the experiments on MNIST digits dataset to demonstrate that our new method is much more efficient than the traditional BO methods. In the application of the unsupervised learning task of text clustering, the empirical results show that the DGMs with the adaptive hyperparameters can surpass the state-of-the-art.
The remainder of this paper is organized as follows. Section 2 reviews the related works about the DGMs and the hyperparameter optimization methods. We also briefly review the unsupervised pre-training procedures of the DGMs in Section 3. Section 4 presents the new method of adaptive optimization for the hyperparameters. Section 5 provides several experimental results on the MINST, the 20 Newsgroups and Reuters-21578 datasets. At last, we give the conclusion in Section 6.
Section snippets
Deep generative models
One of the important reasons for the great success of Deep Learning is that it can extract the meaningful representation from large sets of available data by unsupervised learning through many layers [15], [21], [22]. There are two classic types of the deep models that can do this job successfully on the batch data: convolutional neural networks (CNNs) [23], [24] and deep generative models (DGMs) [10]. The CNNs can extract different levels of the features from the data (typically images)
Unsupervised learning of deep generative models
The deep generative model (DGM) [10] is a hierarchical probabilistic model, which is originated from the Boltzmann machine (BM) [43]. The BM (Fig. 1 - left) is a kind of energy based models, and can be trained quite well by persistent Markov chains on a fairly small data set. However, for large data sets, training the stacked BM is far from efficiency [43]. After the within-layer connections are removed, the new structure is called restricted Boltzmann machine (RBM) (Fig. 1 - right) [43] ,
Approach
When training a DGM, human-experts usually can identify if a hyperparameter setting is proper or not by some kinds of indicators, which include the sparsity of the hidden units, the free energy variation, the visualization of the weights, the error rate of data reconstruction, etc. This paper adopts one kind of these indicators, the free energy variation, which can be used in our hyperparameter optimization. More specifically, the procedure is: in each epoch, we first fix the hyperparameters to
Experiments
In this section, we perform experiments to demonstrate that our method not only speeds up the process of the hyperparameter optimization of the DGMs, but also obtains the competitive performance on the unsupervised learning tasks, where the traditional BO methods are hardly deployed. To achieve this, we organize our experiments into two parts: Section 5.1 focuses on the efficiency of the hyperparameter optimization. We do empirical evaluation on the classification task of MNIST Digits to show
Conclusion
In this paper, we present a new hyperparameter optimization method based on the free-energy for the pre-training of the deep generative models, in which the hyperparameters are tuned automatically and adaptively. Since the pre-training is unsupervised, the traditional Bayesian optimization methods can not be applied directly. To solve this problem, we propose the new holdout loss, the free energy gap. We integrate the Gaussian process to find the best candidate for the next epoch with respect
Acknowledgment
This work was supported by National Basic Research Program of China (973 Program) under Grant 2013CB336500, Chinese National 863 Program of Demonstration of Digital Medical Service and Technology in Destined Region (Grant No. 2012-AA02A614) and National Youth Top-notch Talent Support Program.
Chengwei Yao received the Master degree in Computer Science from Zhejiang University, China, in 2000. He is currently a candidate for a Ph.D. degree in College of Computer Science, at Zhejiang University. His research interests include Machine Learning and information retrieval.
References (68)
- et al.
Hyperparameter learning in probabilistic prototype-based models
Neurocomputing
(2010) - et al.
A 3d model recognition mechanism based on deep Boltzmann machines
Neurocomputing
(2015) - et al.
Time series forecasting using a deep belief network with restricted Boltzmann machines
Neurocomputing
(2014) - et al.
Semantic hashing
Int. J. Approx. Reason.
(2009) - et al.
Deep multimodal distance metric learning using click constraints for image ranking.
IEEE Trans. Cybern.
(2016) - et al.
Recent developments on deep big vision
Neurocomputing
(2016) - et al.
Random search for hyper-parameter optimization
J. Mach. Learn. Res.
(2012) - et al.
Sequential model-based optimization for general algorithm configuration
Proceedings of the 5th International Conference on Learning and Intelligent Optimization , LION 5
(2011) - et al.
Time-bounded sequential parameter optimization
Proceedings of the 4th International Conference on Learning and Intelligent Optimization , LION 4
(2010) - et al.
Practical Bayesian optimization of machine learning algorithms
Proceedings of the 26th Annual Conference on Neural Information Processing Systems
(2012)
Algorithms for hyper-parameter optimization
Proceedings of the 25th Annual Conference on Neural Information Processing Systems
Freeze-thaw Bayesian optimization
Eprint Arxiv
Scalable Bayesian optimization using deep neural networks
Statistics
Learning deep generative models
Ann. Rev. Stat. Appl.
A fast learning algorithm for deep belief nets
Neural Comput
An efficient learning procedure for deep Boltzmann machines
Neural Comput.
Learning structured output representation using deep conditional generative models
Proceedings of the Annual Conference on Neural Information Processing Systems
An empirical evaluation of deep architectures on problems with many factors of variation
Proceedings of the Twenty-Fourth International Conference on Machine Learning
Deep learning
Nature
Why does unsupervised pre-training help deep learning?
J. Mach. Learn. Res.
Gaussian Processes for Machine Learning
Using fast weights to improve persistent contrastive divergence
Proceedings of the 26th Annual International Conference on Machine Learning
Annealed importance sampling
Stat. Comput.
Annealed importance sampling for structure learning in Bayesian networks
Proceedings of the 23rd International Joint Conference on Artificial Intelligence
Coupled deep autoencoder for single image super-resolution.
IEEE Trans. Cybern.
Handwritten digit recognition with a back-propagation network
Proceedings of Advances in Neural Information Processing Systems
Gradient-based learning applied to document recognition
Proceedings of the IEEE
CNNpack: packing convolutional neural networks in the frequency domain
Proceedings of Annual Conference on Neural Information Processing Systems
Greedy layer-wise training of deep networks
Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems
Exploring strategies for training deep neural networks
J. Mach. Learn. Res.
Using deep belief nets to learn covariance kernels for Gaussian processes
Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems
Cited by (40)
An intrusion detection algorithm based on joint symmetric uncertainty and hyperparameter optimized fusion neural network
2024, Expert Systems with ApplicationsPredicting 4D hardness property from 3D datasets for performance-tunable material extrusion additive manufacturing
2024, Materials Today CommunicationsPix2Pix Hyperparameter Optimisation Prediction
2023, Procedia Computer ScienceGlobal meta-analysis of evolution patterns for lake topics over centurial scale: A natural language understanding-based deep clustering approach with 130,000 studies
2022, Journal of HydrologyCitation Excerpt :Transformer, a state-of-the-art deep learning model architecture was proposed as an ideal tool for NLU tasks (Vaswani et al., 2017). With a stack of encoder and decoder layers, Transformer can understand literally long language from large volumes of text datasets, like variant BERT, XLNet, GPT and T5 (Devlin et al., 2018; Raffel et al., 2019; Yang et al., 2019; Yao et al., 2017). Meanwhile, Transformer keeps the advantages of transfer learning, which allows us to reuse the knowledge learned from the one task to another task by adding a few specific layers and fine-tuning on the small amount of dataset (Tan et al., 2018) (Gordon et al., 2020).
Attack classification of an intrusion detection system using deep learning and hyperparameter optimization
2021, Journal of Information Security and ApplicationsCitation Excerpt :The method of tuning hyperparameters improves the performance of ML and DL algorithms compared to the default parameters in libraries [19,44]. Related to HPO, [45] used the Bayesian method for adaptive HPO that models the function of loss and performance of multiple datasets. The random search was used in [43], revealing that a random search technique used on a DNN model can reduce errors and execution time.
Efficient hyperparameter optimization through model-based reinforcement learning
2020, Neurocomputing
Chengwei Yao received the Master degree in Computer Science from Zhejiang University, China, in 2000. He is currently a candidate for a Ph.D. degree in College of Computer Science, at Zhejiang University. His research interests include Machine Learning and information retrieval.
Deng Cai is a Professor in the State Key Lab of CAD&CG, College of Computer Science at Zhejiang University, China. He received the Ph.D. degree in computer science from University of Illinois at Urbana Champaign in 2009. His research interests include machine learning, data mining and information retrieval.
Jiajun Bu received the B.S. and Ph.D. degrees in Computer Science from Zhejiang University, China, in 1995 and 2000, respectively. He is a professor in College of Computer Science and the deputy dean of School of Software Technology at Zhejiang University. His research interests include data mining, intelligent multimedia and etc. He is a member of the IEEE and the ACM.
Gencai Chen is a professor in the College of Computer Science, Zhejiang University. His research interests include DBMS, datamining, CSCW and information retrieval.