Skip to main content
Log in

A distributed parallel training method of deep belief networks

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Nowadays, it has become well known that efficient training of deep neural networks plays a vital role in various successful applications. To achieve this goal, it is impractical to use only one computer, especially when the scale of models is large and some efficient computing resources are available. In this paper, we present a distributed parallel computing framework for training deep belief networks (DBNs) by employing the great power of high-performance clusters (i.e., a system consists of many computers). Motivated by the greedy layer-wise learning algorithm of DBNs, the whole training process is divided layer by layer and distributed to different machines. At the same time, rough representations are exploited to parallelize the training process. By conducting experiments on several large-scale real datasets, the novel algorithms are shown to significantly accelerate the training speed of DBNs while achieving better or competitive prediction accuracy in comparison with the original algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Tying all weight matrices together means the weight matrix of each layer in DBN is constrained to be equal. Taking the DBN shown in Fig. 1 as an example, all weight matrices are tied to \(\mathbf {W}^1\) means setting \(\mathbf {W}^2=\mathbf {W}^3=\mathbf {W}^1\).

  2. http://yann.lecun.com/exdb/mnist/.

  3. http://www.cs.nyu.edu/~ylclab/data/norb-v1.0-small/.

  4. http://qwone.com/~jason/20Newsgroups/.

  5. The authors are grateful to one anonymous reviewer for providing us with the insight into this.

References

  • Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127

    Article  MATH  Google Scholar 

  • Bengio Y, Lamblin P, Popovici D, Larochelle H et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Process Syst 19:153–160

    Google Scholar 

  • Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  • Bishop CM (1995) Training with noise is equivalent to Tikhonov regularization. Neural Comput 7(1):108–116

    Article  Google Scholar 

  • Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: Tenth international workshop on frontiers in handwriting recognition, Suvisoft

  • Chilimbi T, Suzue Y, Apacible J, Kalyanaraman K (2014) Project adam: building an efficient and scalable deep learning training system. In: Usenix conference on operating systems design and implementation, pp 571–582

  • Coates A, Ng AY, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. J Mach Learn Res 15:215–223

    Google Scholar 

  • Coates A, Huval B, Wang T, Wu DJ, Catanzaro B, Andrew N (2013) Deep learning with COTS HPC systems. In: International conference on machine learning, pp 1337–1345

  • Cui H, Zhang H, Ganger GR, Gibbons PB, Xing EP (2016) GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In: Proceedings of the eleventh European conference on computer systems, ACM, p 4

  • Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42

    Article  Google Scholar 

  • Dan CC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep big simple neural nets excel on handwritten digit recognition. Corr 22(12):3207–3220

    Google Scholar 

  • Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Mao MZ, Ranzato A, Senior A, Tucker P (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1232–1240

  • Fischer A, Igel C (2014) Training restricted Boltzmann machines: an introduction. Pattern Recogn 47(1):25–39

    Article  MATH  Google Scholar 

  • Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res 9:249–256

    Google Scholar 

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  • Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800

    Article  MATH  Google Scholar 

  • Hinton GE (2007) Learning multiple layers of representation. Trends Cognit Sci 11(10):428–434

    Article  Google Scholar 

  • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  • Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    Article  MathSciNet  MATH  Google Scholar 

  • Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Saiainath TN (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  • Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst 6(2):107–116

    Article  MathSciNet  MATH  Google Scholar 

  • Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

  • Khumoyun A, Cui Y, Hanku L (2016) Spark based distributed deep learning framework for big data applications. In: International conference on information science and communications technologies (ICISCT), IEEE, pp 1–5

  • Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  • Larochelle H, Mandel M, Pascanu R, Bengio Y (2012) Learning algorithms for the classification restricted Boltzmann machine. J Mach Learn Res 13(3):643–669

    MathSciNet  MATH  Google Scholar 

  • Le QV, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng AY (2011) On optimization methods for deep learning. In: International conference on machine learning, pp 67–05

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  • LeCun Y, Huang FJ, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. IEEE Comput Soc Conf Comput Vis Pattern Recognit 2:97–104

    Google Scholar 

  • Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) MLlib: machine learning in Apache Spark. J Mach Learn Res 17(1):1235–1241

    MathSciNet  MATH  Google Scholar 

  • Mohamed A, Dahl G, Hinton G (2009) Deep belief networks for phone recognition. In: Nips workshop on deep learning for speech recognition and related applications, Vancouver, Canada, vol 1, p 39

  • Moritz P, Nishihara R, Stoica I, Jordan MI (2015) Sparknet: training deep networks in Spark. arXiv preprint arXiv:1511.06051

  • Oh KS, Jung K (2004) GPU implementation of neural networks. Pattern Recognit 37(6):1311–1314

    Article  MATH  Google Scholar 

  • Ouyang W, Zeng X, Wang X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Li H, Wang K, Yan J, Loy CC, Tang X (2017) DeepID-Net: object detection with deformable part based convolutional neural networks. IEEE Trans Pattern Anal Mach Intell 39(7):1320–1334

    Article  Google Scholar 

  • Poole B, Sohl-Dickstein J, Ganguli S (2014) Analyzing noise in autoencoders and deep networks. arXiv preprint arXiv:1406.1831

  • Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  • Rifai S, Glorot X, Bengio Y, Vincent P (2011) Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250

  • Salakhutdinov R (2009) Learning deep generative models. PhD thesis, University of Toronto

  • Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

    Article  Google Scholar 

  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Smolensky P (1986) Information processing in dynamical systems: Foundations of harmony theory. vol 1. MIT Press, Cambridge, MA, USA, chap 6, pp 194–281

  • Szegedy C, Liu W, Jia Y, Sermanet P (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition, pp 1–9

  • Teh YW, Welling M, Osindero S, Hinton GE (2003) Energy-based models for sparse overcomplete representations. J Mach Learn Res 4(12):1235–1260

    MathSciNet  MATH  Google Scholar 

  • Tesauro G (1992) Practical issues in temporal difference learning. Mach Learn 8(3–4):257–277

    MATH  Google Scholar 

  • Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning, ACM, pp 1096–1103

  • Wei J, He J, Chen K, Zhou Y, Tang Z (2017) Collaborative filtering and deep learning based recommendation system for cold start items. Expert Syst Appl 69:29–39

    Article  Google Scholar 

  • Williams CK, Agakov FV (2002) An analysis of contrastive divergence learning in Gaussian Boltzmann machines. Institute for Adaptive and Neural Computation

  • Yuille AL (2005) The convergence of contrastive divergences. In: Advances in neural information processing systems, pp 1593–1600

Download references

Acknowledgements

The authors are very grateful to the editor and reviewers for their valuable comments which greatly helped to improve the paper. This work is supported by the National Basic Research Program of China (973Program No. 2013CB329404), the National Natural Science Foundation of China (Nos. 61572393, 11501049, 11131006, 11671317) and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiangshe Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A derivative of the log-likelihood

The derivative of the log-likelihood with respect to the model parameters \(\varvec{\theta }\) can be obtained from Eq. 2:

$$\begin{aligned} \begin{aligned} \frac{\partial \log P(\mathbf {v}_0;\varvec{\theta })}{\partial \varvec{\theta }}&=\frac{\partial \log Z_{\mathbf {v}_0}(\varvec{\theta })}{\partial \varvec{\theta }} - \frac{\partial \log Z(\varvec{\theta })}{\partial \varvec{\theta }},\\ Z_{\mathbf {v}_0}(\varvec{\theta })&= \sum _\mathbf {h}\exp (-E(\mathbf {v}_0,\mathbf {h})). \end{aligned} \end{aligned}$$
(8)

The first term in Eq. 8 is

$$\begin{aligned} \frac{\partial \log Z_{\mathbf {v}_0}(\varvec{\theta })}{\partial \varvec{\theta }}= & {} \frac{1}{Z_{\mathbf {v}_0}(\varvec{\theta })} \sum _\mathbf {h}\frac{\partial \exp (-E(\mathbf {v}_0,\mathbf {h}))}{\partial \varvec{\theta }}\nonumber \\= & {} - \frac{1}{Z_{\mathbf {v}_0}(\varvec{\theta })} \sum _\mathbf {h}\left( \exp (-E(\mathbf {v}_0,\mathbf {h})) \frac{\partial E(\mathbf {v}_0,\mathbf {h})}{\partial \varvec{\theta }} \right) \nonumber \\= & {} - \sum _\mathbf {h}\left( \frac{\exp (-E(\mathbf {v}_0,\mathbf {h}))}{Z_{\mathbf {v}_0}(\varvec{\theta })} \frac{\partial E(\mathbf {v}_0,\mathbf {h})}{\partial \varvec{\theta }} \right) \nonumber \\= & {} - \sum _\mathbf {h}\left( P(\mathbf {h}|\mathbf {v}_0) \frac{\partial E(\mathbf {v}_0,\mathbf {h})}{\partial \varvec{\theta }}\right) \nonumber \\= & {} - \mathbb {E}_{P(\mathbf {h}|\mathbf {v}_0)} \left[ \frac{\partial E(\mathbf {v}_0,\mathbf {h})}{\partial \varvec{\theta }} \right] , \end{aligned}$$
(9)

where \(P(\mathbf {h}|\mathbf {v}_0)\) is defined in Eq. 4. The second term in Eq. 8 is

$$\begin{aligned} \begin{aligned} \frac{\partial \log Z(\varvec{\theta })}{\partial \varvec{\theta }}&= \frac{1}{Z(\varvec{\theta })} \sum _{\mathbf {h},\mathbf {v}} \frac{\partial \exp (-E(\mathbf {v},\mathbf {h}))}{\partial \varvec{\theta }}\\&=- \frac{1}{Z(\varvec{\theta })} \sum _{\mathbf {h},\mathbf {v}} \left( \exp (-E(\mathbf {v},\mathbf {h})) \frac{\partial E(\mathbf {v},\mathbf {h})}{\partial \varvec{\theta }} \right) \\&=- \sum _{\mathbf {h},\mathbf {v}} \left( \frac{\exp (-E(\mathbf {v},\mathbf {h}))}{Z(\varvec{\theta })} \frac{\partial E(\mathbf {v},\mathbf {h})}{\partial \varvec{\theta }} \right) \\&=- \sum _{\mathbf {h},\mathbf {v}} \left( P(\mathbf {h},\mathbf {v}) \frac{\partial E(\mathbf {v},\mathbf {h})}{\partial \varvec{\theta }} \right) \\&=- \mathbb {E}_{P(\mathbf {v},\mathbf {h})} \left[ \frac{\partial E(\mathbf {v},\mathbf {h})}{\partial \varvec{\theta }} \right] . \end{aligned} \end{aligned}$$
(10)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, G., Zhang, J., Zhang, C. et al. A distributed parallel training method of deep belief networks. Soft Comput 24, 13357–13368 (2020). https://doi.org/10.1007/s00500-020-04754-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-020-04754-6

Keywords

Navigation