Abstract
Nowadays, it has become well known that efficient training of deep neural networks plays a vital role in various successful applications. To achieve this goal, it is impractical to use only one computer, especially when the scale of models is large and some efficient computing resources are available. In this paper, we present a distributed parallel computing framework for training deep belief networks (DBNs) by employing the great power of high-performance clusters (i.e., a system consists of many computers). Motivated by the greedy layer-wise learning algorithm of DBNs, the whole training process is divided layer by layer and distributed to different machines. At the same time, rough representations are exploited to parallelize the training process. By conducting experiments on several large-scale real datasets, the novel algorithms are shown to significantly accelerate the training speed of DBNs while achieving better or competitive prediction accuracy in comparison with the original algorithm.
Similar content being viewed by others
Notes
Tying all weight matrices together means the weight matrix of each layer in DBN is constrained to be equal. Taking the DBN shown in Fig. 1 as an example, all weight matrices are tied to \(\mathbf {W}^1\) means setting \(\mathbf {W}^2=\mathbf {W}^3=\mathbf {W}^1\).
The authors are grateful to one anonymous reviewer for providing us with the insight into this.
References
Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127
Bengio Y, Lamblin P, Popovici D, Larochelle H et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Process Syst 19:153–160
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Bishop CM (1995) Training with noise is equivalent to Tikhonov regularization. Neural Comput 7(1):108–116
Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: Tenth international workshop on frontiers in handwriting recognition, Suvisoft
Chilimbi T, Suzue Y, Apacible J, Kalyanaraman K (2014) Project adam: building an efficient and scalable deep learning training system. In: Usenix conference on operating systems design and implementation, pp 571–582
Coates A, Ng AY, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. J Mach Learn Res 15:215–223
Coates A, Huval B, Wang T, Wu DJ, Catanzaro B, Andrew N (2013) Deep learning with COTS HPC systems. In: International conference on machine learning, pp 1337–1345
Cui H, Zhang H, Ganger GR, Gibbons PB, Xing EP (2016) GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In: Proceedings of the eleventh European conference on computer systems, ACM, p 4
Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42
Dan CC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep big simple neural nets excel on handwritten digit recognition. Corr 22(12):3207–3220
Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Mao MZ, Ranzato A, Senior A, Tucker P (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1232–1240
Fischer A, Igel C (2014) Training restricted Boltzmann machines: an introduction. Pattern Recogn 47(1):25–39
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res 9:249–256
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Hinton GE (2007) Learning multiple layers of representation. Trends Cognit Sci 11(10):428–434
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Saiainath TN (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 29(6):82–97
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst 6(2):107–116
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies
Khumoyun A, Cui Y, Hanku L (2016) Spark based distributed deep learning framework for big data applications. In: International conference on information science and communications technologies (ICISCT), IEEE, pp 1–5
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Larochelle H, Mandel M, Pascanu R, Bengio Y (2012) Learning algorithms for the classification restricted Boltzmann machine. J Mach Learn Res 13(3):643–669
Le QV, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng AY (2011) On optimization methods for deep learning. In: International conference on machine learning, pp 67–05
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
LeCun Y, Huang FJ, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. IEEE Comput Soc Conf Comput Vis Pattern Recognit 2:97–104
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) MLlib: machine learning in Apache Spark. J Mach Learn Res 17(1):1235–1241
Mohamed A, Dahl G, Hinton G (2009) Deep belief networks for phone recognition. In: Nips workshop on deep learning for speech recognition and related applications, Vancouver, Canada, vol 1, p 39
Moritz P, Nishihara R, Stoica I, Jordan MI (2015) Sparknet: training deep networks in Spark. arXiv preprint arXiv:1511.06051
Oh KS, Jung K (2004) GPU implementation of neural networks. Pattern Recognit 37(6):1311–1314
Ouyang W, Zeng X, Wang X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Li H, Wang K, Yan J, Loy CC, Tang X (2017) DeepID-Net: object detection with deformable part based convolutional neural networks. IEEE Trans Pattern Anal Mach Intell 39(7):1320–1334
Poole B, Sohl-Dickstein J, Ganguli S (2014) Analyzing noise in autoencoders and deep networks. arXiv preprint arXiv:1406.1831
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Rifai S, Glorot X, Bengio Y, Vincent P (2011) Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250
Salakhutdinov R (2009) Learning deep generative models. PhD thesis, University of Toronto
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Smolensky P (1986) Information processing in dynamical systems: Foundations of harmony theory. vol 1. MIT Press, Cambridge, MA, USA, chap 6, pp 194–281
Szegedy C, Liu W, Jia Y, Sermanet P (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition, pp 1–9
Teh YW, Welling M, Osindero S, Hinton GE (2003) Energy-based models for sparse overcomplete representations. J Mach Learn Res 4(12):1235–1260
Tesauro G (1992) Practical issues in temporal difference learning. Mach Learn 8(3–4):257–277
Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning, ACM, pp 1096–1103
Wei J, He J, Chen K, Zhou Y, Tang Z (2017) Collaborative filtering and deep learning based recommendation system for cold start items. Expert Syst Appl 69:29–39
Williams CK, Agakov FV (2002) An analysis of contrastive divergence learning in Gaussian Boltzmann machines. Institute for Adaptive and Neural Computation
Yuille AL (2005) The convergence of contrastive divergences. In: Advances in neural information processing systems, pp 1593–1600
Acknowledgements
The authors are very grateful to the editor and reviewers for their valuable comments which greatly helped to improve the paper. This work is supported by the National Basic Research Program of China (973Program No. 2013CB329404), the National Natural Science Foundation of China (Nos. 61572393, 11501049, 11131006, 11671317) and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Human and animal rights
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A derivative of the log-likelihood
The derivative of the log-likelihood with respect to the model parameters \(\varvec{\theta }\) can be obtained from Eq. 2:
The first term in Eq. 8 is
where \(P(\mathbf {h}|\mathbf {v}_0)\) is defined in Eq. 4. The second term in Eq. 8 is
Rights and permissions
About this article
Cite this article
Shi, G., Zhang, J., Zhang, C. et al. A distributed parallel training method of deep belief networks. Soft Comput 24, 13357–13368 (2020). https://doi.org/10.1007/s00500-020-04754-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-04754-6