Skip to main content
Log in

Optimizing restricted Boltzmann machine learning by injecting Gaussian noise to likelihood gradient approximation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Restricted Boltzmann machines (RBMs) can be trained by applying stochastic gradient ascent to the objective function as the maximum likelihood learning. However, it is a difficult task due to the intractability of marginalization function gradient. Several methodologies have been proposed by adopting Gibbs Markov chain to approximate this intractability including Contrastive Divergence, Persistent Contrastive Divergence, and Fast Contrastive Divergence. In this paper, we propose an optimization which is injecting noise to underlying Monte Carlo estimation. We introduce two novel learning algorithms. They are Noisy Persistent Contrastive Divergence (NPCD), and further Fast Noisy Persistent Contrastive Divergence (FNPCD). We prove that the NPCD and FNPCD algorithms benefit on the average to equilibrium state with satisfactory condition. We have performed empirical investigation of diverse CD-based approaches and found that our proposed methods frequently obtain higher classification performance than traditional approaches on several benchmark tasks in standard image classification tasks such as MNIST, basic, and rotation datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Available online at http://yann.lecun.com/exdb/mnist/

  2. Available online at http://www-labs.iro.umontreal.ca/~lisa/icml2007data/mnist_rotation.zip

References

  1. Carreira-Perpinan MA, Hinton GE (2005) On contrastive divergence learning. In: Aistats, vol 10, pp 33–40

  2. Cho K, Ilin A, Raiko T (2011) Improved learning of gaussian-bernoulli restricted Boltzmann machines. Artificial Neural Networks and Machine Learning–ICANN 2011

  3. Cho K, Raiko T, Ilin A (2010) Parallel tempering is efficient for learning restricted Boltzmann machines. In: The 2010 international joint conference on neural networks (ijcnn), pp 1–8. IEEE

  4. Fischer A, Igel C (2012) An introduction to restricted Boltzmann machines. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

  5. Franzke B, Kosko B (2015) Using noise to speed up markov chain monte carlo estimation. Procedia Computer Science 53:113–120

    Article  Google Scholar 

  6. Hinton G (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1):926

    Google Scholar 

  7. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  8. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    Article  MathSciNet  MATH  Google Scholar 

  9. Krogh A, Hertz JA (1992) A simple weight decay can improve generalization. In: Advances in neural information processing systems, pp 950–957

  10. Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area v2. In: Advances in neural information processing systems, pp 873–880

  11. Lee H, Battle A, Raina R, Ng AY (2007) Efficient sparse coding algorithms. In: Advances in neural information processing systems, pp 801–808

  12. Merino ER, Castrillejo FM, Pin JD, Prats DB (2018) Weighted contrastive divergence. arXiv:180102567

  13. Salakhutdinov R, Hinton G (2009) Deep Boltzmann machines. In: Artificial intelligence and statistics, pp 448–455

  14. Tieleman T (2008) Training restricted Boltzmann machines using approximations to the likelihood gradient. In: Proceedings of the 25th international conference on machine learning, pp 1064–1071. ACM

  15. Tieleman T, Hinton G (2009) Using fast weights to improve persistent contrastive divergence. In: Proceedings of the 26th annual international conference on machine learning, pp 1033–1040. ACM

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dae-Ki Kang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Complete derivative of BG-RBM

Appendix: Complete derivative of BG-RBM

Energy function:

$$E\left( \textbf{v},\textbf{h}\right)_{BG} = \sum\limits_{i = 1}^{n_{v}}\frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}}-\sum\limits_{j = 1}^{n_{h}}b_{j}h_{j} -\sum\limits_{i = 1}^{n_{v}}\sum\limits_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j} $$

1.1 Definition of conditional probability P(h|v)

$$\begin{array}{@{}rcl@{}} P(\textbf{h}|\textbf{v}) &=& \frac{P(\textbf{v},\textbf{h})}{P(\textbf{h})} = \frac{\frac{1}{Z}e^{-E(\textbf{v},\textbf{h})}}{\frac{1}{Z}{\sum}_{\textbf{h}} e^{-E(\textbf{v},\textbf{h})}} = \frac{e^{-E(\textbf{v},\textbf{h})}}{{\sum}_{\textbf{h}}e^{-E(\textbf{v},\textbf{h})}} \\ & =& \frac{e^{-\left( {\sum}_{i = 1}^{n_{v}}\frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}}-{\sum}_{j = 1}^{n_{h}}b_{j}h_{j} -{\sum}_{i = 1}^{n_{v}}{\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) }}{{\sum}_{h} e^{-\left( {\sum}_{i = 1}^{n_{v}}\frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}}-{\sum}_{j = 1}^{n_{h}}b_{j}h_{j} -{\sum}_{i = 1}^{n_{v}}{\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) }} \\ & =& \frac{e^{-{\sum}_{i = 1}^{n_{v}} \left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}}{{\sum}_{h} e^{-{\sum}_{i = 1}^{n_{v}} \left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}} \end{array} $$

Rewritting the above equation in term of product of expert model:

$$\begin{array}{@{}rcl@{}} P(\textbf{h}|\textbf{v}) & =& \frac{{\prod}_{j} e^{-{\sum}_{i = 1}^{n_{v}} \left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}}{{\prod}_{j} {\sum}_{h} e^{-{\sum}_{i = 1}^{n_{v}} \left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}} \\ & =& \prod\limits_{j} \frac{e^{\left( \frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}+b_{j}h_{j} \right) }}{{\sum}_{h} e^{\left( \frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}+b_{j}h_{j} \right) } } = \prod\limits_{j} P(h_{j}|\textbf{v}) \end{array} $$

For binary hj ∈{0, 1}, P(hj = 1) equals to:

$$P(h_{j} = 1) = \frac{e^{\left( \frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}+b_{j} \right) }}{e^{\left( \frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}+b_{j} \right) } + e^{0}} = sig\left( \frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}+b_{j}\right) $$

1.2 Definition of conditional probability P(v|h)

$$\begin{array}{@{}rcl@{}} P(\textbf{v}|\textbf{h}) & =& \frac{P(\textbf{v},\textbf{h})}{P(\textbf{v})} = \frac{\frac{1}{Z}e^{-E(\textbf{v},\textbf{h})}}{\frac{1}{Z}{\int}_{\textbf{v}} e^{-E(\textbf{v},\textbf{h})}dv} = \frac{e^{-E(\textbf{v},\textbf{h})}}{{\int}_{\textbf{v}}e^{-E(\textbf{v},\textbf{h})}dv} \\ & =& \frac{e^{-{\sum}_{i = 1}^{n_{v}} \left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}}{{\int}_{\textbf{v}} e^{-{\sum}_{i = 1}^{n_{v}} \left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}dv} \end{array} $$

Rewritting the above equation in term of product of expert model:

$$P(\textbf{v}|\textbf{h}) = \frac{{\prod}_{i} e^{-\left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}}{{\prod}_{i} {\int}_{\textbf{v}} e^{- \left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}dv} $$

Simplifying the denominator:

$$\begin{array}{@{}rcl@{}} P(\textbf{v}|\textbf{h}) & = & \frac{{\prod}_{i} e^{\!-\left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}}{{\prod}_{i} {\int}_{\textbf{v}} e^{-\left( \frac{1}{2{\sigma_{i}^{2}}}({v_{i}^{2}} - 2v_{i}a_{i} + {a_{i}^{2}})-{\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}} W_{ij}h_{j} \right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j} } dv } \\ & = & \frac{{\prod}_{i} e^{-\left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}}{{\prod}_{i} e^{\left( \frac{-{a_{i}^{2}}}{2{\sigma_{i}^{2}}}\right)+{\sum}_{j = 1}^{n_{h}}b_{j}h_{j}} {\int}_{\textbf{v}}e^{\frac{1}{2{\sigma_{i}^{2}}}\left( -{v_{i}^{2}} + 2v_{i}a_{i} \right)+{\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}} dv} \\ & = & \frac{{\prod}_{i} e^{-\left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}}{{\prod}_{i} e^{\left( \frac{-{a_{i}^{2}}}{2{\sigma_{i}^{2}}}\right)+{\sum}_{j = 1}^{n_{h}}b_{j}h_{j}} {\int}_{\textbf{v}}e^{\frac{1}{2{\sigma_{i}^{2}}}\left( -{v_{i}^{2}}\right)} e^{v_{i} \left( \frac{a_{i}}{{\sigma_{i}^{2}}} + {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}} W_{ij}h_{j} \right)^{2} } dv} \end{array} $$

Integrating the denominator w.r.t v:

$$\begin{array}{@{}rcl@{}} & =& \frac{{\prod}_{i} e^{-\left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}}{{\prod}_{i} e^{\left( \frac{-{a_{i}^{2}}}{2{\sigma_{i}^{2}}}\right)+{\sum}_{j = 1}^{n_{h}}b_{j}h_{j}} e^{\frac{{\sigma_{i}^{2}} \left( \frac{a_{i}}{{\sigma_{i}^{2}}} + {\sum}_{j = 1}^{n_{h}}\frac{W_{ij}}{{\sigma_{i}^{2}}}h_{j}\right)^{2} }{2}} \left( \sqrt{2{\sigma_{i}^{2}}\pi}\right)} \\ & =& \frac{{\prod}_{i} e^{-\left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}}}{\left( \sigma_{i}\sqrt{2\pi}\right) {\prod}_{i} e^{\frac{1}{2} \left( {\sum}_{j = 1}^{n_{h}}W_{ij}h_{j} \right)^{2}+{\sum}_{j = 1}^{n_{h}}b_{j}h_{j}+\frac{a_{i}W_{ij}h_{j}}{{\sigma_{i}^{2}}}} } \end{array} $$

Simplifying the above equation:

$$\begin{array}{@{}rcl@{}} & =& \prod\limits_{i} \frac{1}{\sigma_{i}\sqrt{2\pi}} e^{-\left( \frac{(v_{i}-a_{i})^{2}}{2{\sigma_{i}^{2}}} - {\sum}_{j = 1}^{n_{h}}\frac{v_{i}}{{\sigma_{i}^{2}}}W_{ij}h_{j}\right) + {\sum}_{j = 1}^{n_{h}}b_{j}h_{j}-\left( \frac{1}{2} \left( {\sum}_{j = 1}^{n_{h}}W_{ij}h_{j} \right)^{2}+{\sum}_{j = 1}^{n_{h}}b_{j}h_{j}+\frac{a_{i}W_{ij}h_{j}}{{\sigma_{i}^{2}}}\right) } \\ & =& \prod\limits_{i} \frac{1}{\sigma_{i}\sqrt{2\pi}}e^{-\frac{1}{2{\sigma_{i}^{2}}} \left( v_{i} - \left( a_{i}+{\sum}_{j = 1}^{n_{h}}W_{ij}h_{j}\right) \right)^{2}} \end{array} $$

The above equation is a probability density function of Gaussian distribution with mean \(v_{i} - \left (a_{i}+{\sum }_{j = 1}^{n_{h}}W_{ij}h_{j}\right )\) and variance \({\sigma _{i}^{2}}\).

1.3 Derivative of log-likelihood function

$$\begin{array}{@{}rcl@{}} \frac{\partial\ln P(\textbf{v})}{\partial{W_{ij}}} & =& \frac{\partial\ln{\sum}_{\textbf{h}}e^{-E(\textbf{v},\textbf{h})} }{\partial{W_{ij}}}-\frac{\partial\ln{\sum}_{\textbf{h}}{\sum}_{\textbf{v}}e^{-E(\textbf{v},\textbf{h})} }{\partial{W_{ij}}} \\ & =& \frac{1}{{\sum}_{\textbf{h}}e^{-E(\textbf{v},\textbf{h})}} \left( {\sum}_{\textbf{h}}e^{-E(\textbf{v},\textbf{h})}.-\frac{\partial E(\textbf{v},\textbf{h})}{\partial{W_{ij}}} \right) \\&&- \frac{1}{{\sum}_{\textbf{v},\textbf{h}}e^{-E(\textbf{v},\textbf{h})}} \left( {\sum}_{\textbf{v},\textbf{h}}e^{-E(\textbf{v},\textbf{h})}.-\frac{\partial E(\textbf{v},\textbf{h})}{\partial{W_{ij}}} \right) \\ & =& - {\sum}_{\textbf{h}}\frac{e^{-E(\textbf{v},\textbf{h})}}{{\sum}_{\textbf{h}} e^{-E(\textbf{v},\textbf{h})}}.\left( \frac{\partial E(\textbf{v},\textbf{h})}{\partial{W_{ij}}} \right) \\&&+ {\sum}_{\textbf{v},\textbf{h}}\frac{e^{-E(\textbf{v},\textbf{h})}}{{\sum}_{\textbf{v},\textbf{h}} e^{-E(\textbf{v},\textbf{h})}}.\left( \frac{\partial E(\textbf{v},\textbf{h})}{\partial{W_{ij}}} \right) \\ & =& -{\sum}_{\textbf{h}} P(\textbf{h}|\textbf{v}).\frac{\partial E(\textbf{v},\textbf{h})}{\partial{W_{ij}}} + {\sum}_{\textbf{v},\textbf{h}} P(\textbf{h},\textbf{v}).\frac{\partial E(\textbf{v},\textbf{h})}{\partial{W_{ij}}} \end{array} $$

1.4 Derivative of log-likelihood approximation

$$\begin{array}{@{}rcl@{}} \frac{\partial\ln P(\textbf{v})}{\partial{W_{ij}}} & \simeq& -\sum\limits_{\textbf{h}} P(\textbf{h}|\textbf{v}) \frac{\partial E(\textbf{v},\textbf{h})}{\partial{W_{ij}}} + {\sum}_{\tilde{\textbf{v}}} P(\tilde{\textbf{v}}) {\sum}_{\tilde{\textbf{h}}} P(\tilde{\textbf{h}}|\tilde{\textbf{v}}) \frac{\partial E(\tilde{\textbf{v}},\tilde{\textbf{h}})}{\partial{W_{ij}}} \\ & \simeq& \sum\limits_{\textbf{h}} P(\textbf{h}|\textbf{v})\frac{v_{i}h_{j}}{\sigma^{2}} - {\sum}_{\tilde{\textbf{v}}}P(\tilde{\textbf{v}}) {\sum}_{\tilde{\textbf{h}}} P(\tilde{\textbf{h}}|\tilde{\textbf{v}})\frac{\tilde{v_{i}}\tilde{h_{j}}}{\sigma^{2}} \\ & \simeq& \sum\limits_{\textbf{h}} P(h=+ 1|\textbf{v})\frac{v_{i}}{\sigma^{2}}-{\sum}_{\tilde{\textbf{v}}}P(\tilde{\textbf{v}}){\sum}_{\tilde{\textbf{h}}}P(\tilde{h} = + 1|\tilde{\textbf{v}})\frac{\tilde{v_{i}}}{\sigma^{2}} \end{array} $$

By applying this procedure to all biases, we obtain the gradients as follows:

$$\begin{array}{@{}rcl@{}} \frac{\partial\ln P(\textbf{v})}{\partial{W_{ij}}} & \simeq& \frac{v_{i} h_{j} - \tilde{v}_{i} \tilde{h}_{j}}{\sigma^{2}} \end{array} $$
$$\begin{array}{@{}rcl@{}} \frac{\partial\ln P(\textbf{v})}{\partial{a_{i}}} & \simeq& \frac{v_{i} - \tilde{v}_{i} }{\sigma^{2}} \\ \frac{\partial\ln P(\textbf{v})}{\partial{b_{j}}} & \simeq& h_{j} - \tilde{h}_{j} \end{array} $$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sanjaya, P., Kang, DK. Optimizing restricted Boltzmann machine learning by injecting Gaussian noise to likelihood gradient approximation. Appl Intell 49, 2723–2734 (2019). https://doi.org/10.1007/s10489-018-01400-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-01400-5

Keywords

Navigation