Abstract
Parameter-server frameworks play an important role in scaling-up distributed deep learning algorithms. However, the constant growth of neural network size has led to a serious bottleneck on exchanging parameters across machines. Recent efforts rely on manually setting a parameter-exchanging interval to reduce communication overhead, regardless of the parameter-server’s resource availability as well. It may face poor performance or inaccurate results for inappropriate interval. Meanwhile, request burst may occur, exacerbating the bottleneck.
In this paper, we propose an approach to automatically set the optimal exchanging interval, aiming to remove the parameter-exchanging bottleneck and to evenly utilize resources without losing training accuracy. The key idea is to increase the interval on different training nodes on the basis of the knowledge of available resources and choose different intervals for each slave node to avoid request bursts. We adopted this method to optimize the parallel Stochastic Gradient Descent algorithm, through which we successfully sped up parameter-exchanging process by eight times.
Similar content being viewed by others
References
Parameter-server framework with parameter-exchanging interval setting code. http://github.com/sherryshare/deep-learn-framework
Prediction as a candidate for learning deep hierarchical models of data code. http://github.com/rasmusbergpalm/DeepLearnToolbox
Ahmed A, Aly M, Gonzalez J, Narayanamurthy S, Smola AJ (2012) Scalable inference in latent variable models. In: Proceedings of the 5th ACM international conference on web search and data mining, pp. 123–132
Bengio Y, Lamblin P, Popovici D, Larochelle H (2007) Greedy layer-wise training of deep networks. In: Proceedings of the 19th advances in neural information processing systems, pp. 153–160
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of the 19th international conference on computational statistics, pp. 177–186
Chilimbi T, Suzue Y, Apacible J, Kalyanaraman K (2014) Project adam: Building an efficient and scalable deep learning training system. In: Proceedings of the 11th USENIX symposium on operating systems design and implementation, pp. 571–582
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV et al (2012) Large scale distributed deep networks. In: Proceedings of the 25th advances in neural information processing systems, pp. 1223–1231
Gemulla R, Nijkamp E, Haas PJ, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 69– 77
Hall KB, Gilpin S, Mann G (2010) Mapreduce/bigtable for distributed optimization. In: Proceedings of the NIPS 2010 workshop on learning on cores, clusters and clouds, pp. 1–7
Hinton GE, Zemel RS (1994) Autoencoders, minimum description length, and helmholtz free energy. In: Proceedings of the 6th advances in neural information processing systems, pp. 3–10
Le Q, Ranzato M, Monga R, Devin M, Chen K, Corrado G, Dean J, Ng A (2012) Building high-level features using large scale unsupervised learning. In: Proceedings of the 29th international conference on machine learning, pp. 81–88
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX conference on operating systems design and implementation, pp. 583–598
McDonald R, Hall K, Mann G (2010) Distributed training strategies for the structured perceptron. In: Proceedings of the 2010 annual conference of the north american chapter of the association for computational linguistics, pp. 456–464
Mcdonald R, Mohri M, Silberman N, Walker D, Mann GS (2009) Efficient large-scale distributed training of conditional maximum entropy models. In: Proceedings of the 22th advances in neural information processing systems, pp. 1231–1239
Recht B, Re C, Wright S, Niu F (2011) Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of the 24th advances in neural information processing systems, pp. 693–701
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536
Saaty TL (1961) Elements of queueing theory McGraw-Hill
Schiller J (2003) Mobile Communication Addison Wesley
Smola A, Narayanamurthy S (2010) An architecture for parallel topic models. Proceedings of the VLDB Endowment 3(1-2):703–710
Wang P, Xu B, Wu Y, Zhou X (2015) Link prediction in social networks: the state-of-the-art. SCIENCE CHINA Inf. Sci. 58(1):1–38
Zhuang Y, Chin WS, Juan YC, Lin CJ (2013) A fast parallel sgd for matrix factorization in shared memory systems. In: Proceedings of the 7th ACM conference on recommender systems, pp. 249–256
Zinkevich M, Weimer M, Li L, Smola AJ (2010) Parallelized stochastic gradient descent Proceedings of the 23th advances in neural information processing systems, pp. 2595–2603
Acknowledgments
This paper is supported by National Hightech Research and Development Program of China (863 Program) under grant No.2015AA015303, National Natural Science Foundation of China under grant No. 61322210, 61272408, 61433019, Doctoral Fund of Ministry of Education of China under grant No. 20130142110048.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, S., Liao, X., Fan, X. et al. Automatically Setting Parameter-Exchanging Interval for Deep Learning. Mobile Netw Appl 22, 186–194 (2017). https://doi.org/10.1007/s11036-016-0740-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11036-016-0740-6