Elsevier

Neurocomputing

Volume 247, 19 July 2017, Pages 144-155
Neurocomputing

Pre-training the deep generative models with adaptive hyperparameter optimization

https://doi.org/10.1016/j.neucom.2017.03.058Get rights and content

Abstract

The performance of many machine learning algorithms depends crucially on the hyperparameter settings, especially in Deep Learning. Manually tuning the hyperparameters is laborious and time consuming. To address this issue, Bayesian optimization (BO) methods and their extensions have been proposed to optimize the hyperparameters automatically. However, they still suffer from highly computational expense when applying to deep generative models (DGMs) due to their strategy of the black-box function optimization. This paper provides a new hyperparameter optimization procedure at the pre-training phase of the DGMs, where we avoid combining all layers as one black-box function by taking advantage of the layer-by-layer learning strategy. Following this procedure, we are able to optimize multiple hyperparameters in an adaptive way by using Gaussian process. In contrast to the traditional BO methods, which mainly focus on the supervised models, the pre-training procedure is unsupervised where there is no validation error can be used. To alleviate this problem, this paper proposes a new holdout loss, the free energy gap, which takes into account both factors of the model fitting and over-fitting. The empirical evaluations demonstrate that our method not only speeds up the process of hyperparameter optimization, but also improves the performances of DGMs significantly in both the supervised and unsupervised learning tasks.

Introduction

Many machine learning algorithms are very sensitive to their hyperparameter settings. Traditionally, tuning hyperparameters is the job of human experts, which is laborious and difficult. When it comes to Deep Learning algorithms, the challenge is even higher, because there are so many hyperparameters, which govern not only the structures of the deep architectures but also their learning strategies and procedures.

A natural attempt is the grid search, which suffers from the exponential combinations of the hyperparameters. Later empirical studies show that the random search [1] is more efficient than the grid search. In recent years, the framework of Bayesian optimization (BO) [2], [3], [4], [5], [6] has shown success in finding optimal hyperparameter configurations within as less time budget as possible. These kinds of optimizers aim to build the surrogate models to discover the intrinsic relationship between the hyperparameters and a holdout loss. The most popular models based on this framework are sequential model-based algorithm configuration (SMAC) [2], tree of Parzen estimators (TPE) [5] and Spearmint [4]. They have been further extended to meet the requirement of hyperparameter optimization (HO) for deep models [7], [8], [9], in which the number of the hyperparameters is much larger and the training time is much longer than that of traditional machine learning algorithms.

Despite their achievements in the application of deep models, they still suffer from highly computational expense. The reason is that the BO methods treat a deep model as one black-box function, which has to be executed fully or partially for many times before the hyperparameters can be assessed. Another weakness for the traditional BO methods is that they cannot directly be applied to unsupervised learning tasks due to the absence of the validation errors.

This paper focuses on a specific class of deep models, the deep generative models (DGMs)[10], [11], [12], and provides a new adaptive hyperparameter optimization procedure at pre-training phase where we avoid combining all the layers as one black-box function. Following this procedure, our method not only speeds up the process of the HO in the supervised learning tasks, but also can be directly applied to the unsupervised learning tasks, where the traditional BO methods are hard to be implemented.

The classic architectures of the DGMs include deep belief network (DBN) [11] and deep Boltzmann machine (DBM) [12], [13]. They can successfully learn the distributed representation[10], [14] from large sets of unlabeled data by the greedily unsupervised layer-by-layer learning strategy [15], [16]. By taking advantage of this strategy, the inner loop information can be used when optimizing the hyperparameters. Specifically, when pre-training the DGMs in each epoch, we first fix the hyperparameters and learn the weights by the traditional procedure, then fix the weights and infer the best candidate of the hyperparameters for the next epoch by using Gaussian process (GP) [17]. The hyperparameters we can optimize simultaneously include the momentum, the learning rate, the weight cost, the sparsity cost, etc. In contrast to the traditional BO methods, which mainly focus on the supervised models, the pre-training follows the unsupervised procedure, where there is no holdout error can be adopted. An alternative choice is p(v), the probability of the observations. However, evaluating p(v) in the DGMs has to compute the so-called partition function, which is intractable [18]. [19], [20] proposed annealed importance sampling (AIS) to approximately estimate the partition function. However the algorithm itself is computationally intensive, because it usually has to iterate more than hundreds of times to gradually approximate the partition function.

This paper proposes a new holdout loss, the free energy gap (FEG), which can acted as the target when inferring the best candidate of the hyperparameters by GP. Here we combine two kinds of the FEGs, which are denoted as FEGf and FEGo. The FEGf is an indicator of the model fitting, while the FEGo can monitor the over-fitting. More specifically, we try to find the better hyperparameters which can potentially increase FEGf as high as possible and decrease FEGo as low as possible. However, in the real training process, the FEGf and FEGo follow the different steps of variation, which make it difficult to directly combine these two FEGs as one useful loss. To solve this problem, we introduce the technique that can evaluate exponential decay of the variation by use of the moving average of the 2nd moment, in order that the velocities of the FEGf and FEGo will be taken into account. The definitions of the FEGf and FEGo are given in detail in Section 4.

We perform the experiments on MNIST digits dataset to demonstrate that our new method is much more efficient than the traditional BO methods. In the application of the unsupervised learning task of text clustering, the empirical results show that the DGMs with the adaptive hyperparameters can surpass the state-of-the-art.

The remainder of this paper is organized as follows. Section 2 reviews the related works about the DGMs and the hyperparameter optimization methods. We also briefly review the unsupervised pre-training procedures of the DGMs in Section 3. Section 4 presents the new method of adaptive optimization for the hyperparameters. Section 5 provides several experimental results on the MINST, the 20 Newsgroups and Reuters-21578 datasets. At last, we give the conclusion in Section 6.

Section snippets

Deep generative models

One of the important reasons for the great success of Deep Learning is that it can extract the meaningful representation from large sets of available data by unsupervised learning through many layers [15], [21], [22]. There are two classic types of the deep models that can do this job successfully on the batch data: convolutional neural networks (CNNs) [23], [24] and deep generative models (DGMs) [10]. The CNNs can extract different levels of the features from the data (typically images)

Unsupervised learning of deep generative models

The deep generative model (DGM) [10] is a hierarchical probabilistic model, which is originated from the Boltzmann machine (BM) [43]. The BM (Fig. 1 - left) is a kind of energy based models, and can be trained quite well by persistent Markov chains on a fairly small data set. However, for large data sets, training the stacked BM is far from efficiency [43]. After the within-layer connections are removed, the new structure is called restricted Boltzmann machine (RBM) (Fig. 1 - right) [43] ,

Approach

When training a DGM, human-experts usually can identify if a hyperparameter setting is proper or not by some kinds of indicators, which include the sparsity of the hidden units, the free energy variation, the visualization of the weights, the error rate of data reconstruction, etc. This paper adopts one kind of these indicators, the free energy variation, which can be used in our hyperparameter optimization. More specifically, the procedure is: in each epoch, we first fix the hyperparameters to

Experiments

In this section, we perform experiments to demonstrate that our method not only speeds up the process of the hyperparameter optimization of the DGMs, but also obtains the competitive performance on the unsupervised learning tasks, where the traditional BO methods are hardly deployed. To achieve this, we organize our experiments into two parts: Section 5.1 focuses on the efficiency of the hyperparameter optimization. We do empirical evaluation on the classification task of MNIST Digits to show

Conclusion

In this paper, we present a new hyperparameter optimization method based on the free-energy for the pre-training of the deep generative models, in which the hyperparameters are tuned automatically and adaptively. Since the pre-training is unsupervised, the traditional Bayesian optimization methods can not be applied directly. To solve this problem, we propose the new holdout loss, the free energy gap. We integrate the Gaussian process to find the best candidate for the next epoch with respect

Acknowledgment

This work was supported by National Basic Research Program of China (973 Program) under Grant 2013CB336500, Chinese National 863 Program of Demonstration of Digital Medical Service and Technology in Destined Region (Grant No. 2012-AA02A614) and National Youth Top-notch Talent Support Program.

Chengwei Yao received the Master degree in Computer Science from Zhejiang University, China, in 2000. He is currently a candidate for a Ph.D. degree in College of Computer Science, at Zhejiang University. His research interests include Machine Learning and information retrieval.

References (68)

  • J. Bergstra et al.

    Algorithms for hyper-parameter optimization

    Proceedings of the 25th Annual Conference on Neural Information Processing Systems

    (2011)
  • K. Swersky et al.

    Freeze-thaw Bayesian optimization

    Eprint Arxiv

    (2014)
  • T. Domhan, J.T. Springenberg, F. Hutter, Speeding up automatic hyperparameter optimization of deep neural networks by...
  • J. Snoek et al.

    Scalable Bayesian optimization using deep neural networks

    Statistics

    (2015)
  • R. Salakhutdinov

    Learning deep generative models

    Ann. Rev. Stat. Appl.

    (2015)
  • G.E. Hinton et al.

    A fast learning algorithm for deep belief nets

    Neural Comput

    (2006)
  • R. Salakhutdinov et al.

    An efficient learning procedure for deep Boltzmann machines

    Neural Comput.

    (2012)
  • SohnK. et al.

    Learning structured output representation using deep conditional generative models

    Proceedings of the Annual Conference on Neural Information Processing Systems

    (2015)
  • H. Larochelle et al.

    An empirical evaluation of deep architectures on problems with many factors of variation

    Proceedings of the Twenty-Fourth International Conference on Machine Learning

    (2007)
  • LeCunY. et al.

    Deep learning

    Nature

    (2015)
  • D. Erhan et al.

    Why does unsupervised pre-training help deep learning?

    J. Mach. Learn. Res.

    (2010)
  • C.E. Rasmussen et al.

    Gaussian Processes for Machine Learning

    (2006)
  • T. Tieleman et al.

    Using fast weights to improve persistent contrastive divergence

    Proceedings of the 26th Annual International Conference on Machine Learning

    (2009)
  • R.M. Neal

    Annealed importance sampling

    Stat. Comput.

    (2001)
  • T.M. Niinimäki et al.

    Annealed importance sampling for structure learning in Bayesian networks

    Proceedings of the 23rd International Joint Conference on Artificial Intelligence

    (2013)
  • G. Hinton, A practical guide to training restricted Boltzmann machines, http://www.cs.toronto.edu/hinton/ (2010)...
  • ZengK. et al.

    Coupled deep autoencoder for single image super-resolution.

    IEEE Trans. Cybern.

    (2015)
  • Y. LeCun et al.

    Handwritten digit recognition with a back-propagation network

    Proceedings of Advances in Neural Information Processing Systems

    (1989)
  • LecunY. et al.

    Gradient-based learning applied to document recognition

    Proceedings of the IEEE

    (1998)
  • M. Figurnov, A. Ibraimova, D.P. Vetrov, P. Kohli, PerforatedCNNs: Acceleration through elimination of redundant...
  • WangY. et al.

    CNNpack: packing convolutional neural networks in the frequency domain

    Proceedings of Annual Conference on Neural Information Processing Systems

    (2016)
  • Y. Bengio et al.

    Greedy layer-wise training of deep networks

    Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems

    (2006)
  • H. Larochelle et al.

    Exploring strategies for training deep neural networks

    J. Mach. Learn. Res.

    (2009)
  • R. Salakhutdinov et al.

    Using deep belief nets to learn covariance kernels for Gaussian processes

    Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems

    (2007)
  • Cited by (40)

    • Pix2Pix Hyperparameter Optimisation Prediction

      2023, Procedia Computer Science
    • Global meta-analysis of evolution patterns for lake topics over centurial scale: A natural language understanding-based deep clustering approach with 130,000 studies

      2022, Journal of Hydrology
      Citation Excerpt :

      Transformer, a state-of-the-art deep learning model architecture was proposed as an ideal tool for NLU tasks (Vaswani et al., 2017). With a stack of encoder and decoder layers, Transformer can understand literally long language from large volumes of text datasets, like variant BERT, XLNet, GPT and T5 (Devlin et al., 2018; Raffel et al., 2019; Yang et al., 2019; Yao et al., 2017). Meanwhile, Transformer keeps the advantages of transfer learning, which allows us to reuse the knowledge learned from the one task to another task by adding a few specific layers and fine-tuning on the small amount of dataset (Tan et al., 2018) (Gordon et al., 2020).

    • Attack classification of an intrusion detection system using deep learning and hyperparameter optimization

      2021, Journal of Information Security and Applications
      Citation Excerpt :

      The method of tuning hyperparameters improves the performance of ML and DL algorithms compared to the default parameters in libraries [19,44]. Related to HPO, [45] used the Bayesian method for adaptive HPO that models the function of loss and performance of multiple datasets. The random search was used in [43], revealing that a random search technique used on a DNN model can reduce errors and execution time.

    View all citing articles on Scopus

    Chengwei Yao received the Master degree in Computer Science from Zhejiang University, China, in 2000. He is currently a candidate for a Ph.D. degree in College of Computer Science, at Zhejiang University. His research interests include Machine Learning and information retrieval.

    Deng Cai is a Professor in the State Key Lab of CAD&CG, College of Computer Science at Zhejiang University, China. He received the Ph.D. degree in computer science from University of Illinois at Urbana Champaign in 2009. His research interests include machine learning, data mining and information retrieval.

    Jiajun Bu received the B.S. and Ph.D. degrees in Computer Science from Zhejiang University, China, in 1995 and 2000, respectively. He is a professor in College of Computer Science and the deputy dean of School of Software Technology at Zhejiang University. His research interests include data mining, intelligent multimedia and etc. He is a member of the IEEE and the ACM.

    Gencai Chen is a professor in the College of Computer Science, Zhejiang University. His research interests include DBMS, datamining, CSCW and information retrieval.

    View full text