skip to main content
survey

Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools

Published: 06 February 2020 Publication History

Abstract

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains, such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multi-tenant resource scheduling, and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.

References

[1]
NVIDIA. NVIDIA Collective Communications Library (NCCL). Retrieved from https://developer.nvidia.com/nccl.
[2]
NVIDIA. NVIDIA DGX Station. Retrieved from https://www.nvidia.com/en-us/data-center/dgx-station/.
[3]
ONNX Project Contributors. ONNX. Retrieved from https://onnx.ai/.
[4]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, Savannah, GA, 265–283. Retrieved from https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi.
[5]
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS’16). ACM, New York, NY, 308–318.
[6]
Alekh Agarwal, Oliveier Chapelle, Miroslav Dudík, and John Langford. 2014. A reliable effective terascale linear learning system. J. Mach. Learn. Res. 15 (2014), 1111–1133. Retrieved from http://jmlr.org/papers/v15/agarwal14a.html.
[7]
Takuya Akiba, Keisuke Fukuda, and Shuji Suzuki. 2017. ChainerMN: Scalable distributed deep learning framework. In Proceedings of the Workshop on ML Systems in the 31st Annual Conference on Neural Information Processing Systems (NIPS’17). Retrieved from http://learningsys.org/nips17/assets/papers/paper_25.pdf.
[8]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 1709–1720. Retrieved from http://papers.nips.cc/paper/6768-qsgd-communication-efficient-sgd-via-gradient-quantization-and-encoding.pdf.
[9]
Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2010. A view of cloud computing. Commun. ACM 53, 4 (Apr. 2010), 50–58.
[10]
K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. 2017. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 34, 6 (Nov. 2017), 26–38.
[11]
E. Azarkhish, D. Rossi, I. Loi, and L. Benini. 2018. Neurostream: Scalable and energy efficient deep learning with smart memory cubes. IEEE Trans. Parallel Distrib. Syst. 29, 2 (Feb. 2018), 420–434.
[12]
Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. 2017. Designing neural network architectures using reinforcement learning. In Proceedings of the International Conference on Learning Representations.
[13]
Luiz Andre Barroso and Urs Hoelzle. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (1st ed.). Morgan and Claypool Publishers.
[14]
Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. Retrieved from http://arxiv.org/abs/1802.09941.
[15]
James Bergstra, Frédéric Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, David Warde-Farley, Ian Goodfellow, Arnaud Bergeron, et al. 2011. Theano: Deep learning on GPUs with python. In Proceedings of the BigLearning Workshop (NIPS’11), Vol. 3. Citeseer, 1–48.
[16]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 1 (Feb. 2012), 281--305. Retrieved from http://dl.acm.org/citation.cfm?id=2503308.2188395.
[17]
B. Bhattacharjee, S. Boag, C. Doshi, P. Dube, B. Herta, V. Ishakian, K. R. Jayaram, R. Khalaf, A. Krishna, Y. B. Li, V. Muthusamy, R. Puri, Y. Ren, F. Rosenberg, S. R. Seelam, Y. Wang, J. M. Zhang, and L. Zhang. 2017. IBM deep learning service. IBM J. Res. Dev. 61, 4/5 (July 2017), 10:1--10:11.
[18]
Scott Boag, Parijat Dube, Benjamin Herta, Waldemar Hummer, Vatche Ishakian, K. R. Jayaram, Michael Kalantar, Vinod Muthusamy, Priya Nagpurkar, and Florian Rosenberg. 2017. Scalable multi-framework multi-tenant lifecycle management of deep learning training jobs. In Proceedings of the Workshop on ML Systems (NIPS’17). Retrieved from http://hummer.io/docs/2017-nips-ffdl.pdf.
[19]
Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe M. Kiddon, Jakub Konec̆ný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. 2019. Towards federated learning at scale: System design. In Proceedings of the Conference on Systems and Machine Learning (SysML’19). Retrieved from https://arxiv.org/abs/1902.01046.
[20]
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 8 Data Mining (KDD’18). ACM, New York, NY, 71--79.
[21]
Irem Boybat, Manuel Le Gallo, S. R. Nandakumar, Timoleon Moraitis, Thomas Parnell, Tomas Tuma, Bipin Rajendran, Yusuf Leblebici, Abu Sebastian, and Evangelos Eleftheriou. 2018. Neuromorphic computing with multi-memristive synapses. Nature Commun. 9, 1 (2018), 2514.
[22]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3 (2011), 27:1--27:27. Retrieved from http://www.csie.ntu.edu.tw/cjlin/libsvm.
[23]
K. Chen and Q. Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5880--5884.
[24]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Proceedings of Workshop on Machine Learning Systems (LearningSys’15) in the 29th Annual Conference on Neural Information Processing Systems (NIPS’15). Retrieved from http://learningsys.org/papers/LearningSys_2015_paper_1.pdf.
[25]
X. Chen and X. Lin. 2014. Big data deep learning: Challenges and perspectives. IEEE Access 2 (2014), 514--525.
[26]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. Retrieved from http://arxiv.org/abs/1410.0759.
[27]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 571--582. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi.
[28]
James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, and Eric Xing. 2013. Solving the straggler problem with bounded staleness. In Proceedings of the 14th Workshop on Hot Topics in Operating Systems (HotOS’13). USENIX, Santa Ana Pueblo, NM. Retrieved from https://www.usenix.org/conference/hotos13/solving-straggler-problem-bounded-staleness.
[29]
Dan Cireşan, Ueli Meier, Jonathan Masci, and Jürgen Schmidhuber. 2012. Multi-column deep neural network for traffic sign classification. Neural Netw. 32 (2012), 333--338.
[30]
Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. 2010. Deep, big, simple neural nets for handwritten digit recognition. Neural Comput. 22, 12 (2010), 3207--3220.
[31]
Microsoft. Multiple GPUs and Machines -- Cognitive Toolkit -- CNTK | Microsoft Docs. Retrieved from https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines.
[32]
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng, and Bryan Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on International Conference on Machine Learning Volume 28 (ICML’13). JMLR.org, III--1337--III--1345. Retrieved from http://dl.acm.org/citation.cfm?id=3042817.3043086.
[33]
Apple. Apple CoreML. Retrieved from https://developer.apple.com/machine-learning/.
[34]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A low-latency online prediction serving system. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). USENIX Association, Boston, MA, 613--627. Retrieved from https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshaw.
[35]
Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2014. Exploiting bounded staleness to speed up big data analytics. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley, CA, 37--48. Retrieved from http://dl.acm.org/citation.cfm?id=2643634.2643639.
[36]
Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). ACM, New York, NY.
[37]
Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, and Eric P. Xing. 2015. High-performance distributed ML at scale through parameter server consistency models. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). AAAI Press, 79--87. Retrieved from http://dl.acm.org/citation.cfm?id=2887007.2887019.
[38]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. MIT Press, 1223--1231.
[39]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113.
[40]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248--255.
[41]
Li Deng. 2014. A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Info. Process. 3 (2014), e2.
[42]
Valentin Deyringer, Alexander Fraser, Helmut Schmid, and Tsuyoshi Okita. 2017. Parallelization of neural network training for NLP with Hogwild! Prague Bull. Math. Linguist. 109, 1 (2017), 29--38. Retrieved from https://content.sciendo.com/view/journals/pralin/109/1/article-p29.xml.
[43]
Eclipse Deeplearning4j. Deeplearning4j on Spark: Technical Explanation. Retrieved from https://deeplearning4j.org/docs/latest/deeplearning4j-scaleout-technicalref.
[44]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (July 2011), 2121--2159. Retrieved from http://dl.acm.org/citation.cfm?id=1953048.2021068.
[45]
Bradley J. Erickson, Panagiotis Korfiatis, Zeynettin Akkus, Timothy Kline, and Kenneth Philbrick. 2017. Toolkits and libraries for deep learning. J. Dig. Imag. 30, 4 (Aug. 2017), 400--405.
[46]
C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 109--116.
[47]
Volker Fischer, Jan Koehler, and Thomas Pfeil. 2018. The streaming rollout of deep networks—Toward fully model-parallel execution. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 4043--4054. Retrieved from http://papers.nips.cc/paper/7659-the-streaming-rollout-of-deep-networks-towards-fully-model-parallel-execution.pdf.
[48]
Eibe Frank, Mark Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer, Ian H. Witten, and Len Trigg. 2010. Weka-A Machine Learning Workbench for Data Mining. Springer US, Boston, MA, 1269--1277.
[49]
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI’11). USENIX Association, Berkeley, CA, 323--336. http://dl.acm.org/citation.cfm?id=1972457.1972490
[50]
Andrew Gibiansky. [n.d.]. Bringing HPC Techniques to Deep Learning. Retrieved from http://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/. http://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/.
[51]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. MIT Press, 2672--2680.
[52]
Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2014. Multi-resource packing for cluster schedulers. In Proceedings of the ACM Conference on SIGCOMM (SIGCOMM’14). ACM, New York, NY, 455--466.
[53]
Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning. MIT Press, 1764--1772.
[54]
Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S. McKinley, and Björn B. Brandenburg. 2017. Swayam: Distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware’17). ACM, New York, NY, 109--120.
[55]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning Volume 37 (ICML’15). JMLR.org, 1737--1746. Retrieved from http://dl.acm.org/citation.cfm?id=3045118.3045303.
[56]
A. Halevy, P. Norvig, and F. Pereira. 2009. The unreasonable effectiveness of data. IEEE Intell. Syst. 24, 2 (March 2009), 8--12.
[57]
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, and Phillip B. Gibbons. 2018. PipeDream: Fast and efficient pipeline parallel DNN training. Retrieved from http://arxiv.org/abs/1806.03377.
[58]
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Gregory R. Ganger, and Phillip B. Gibbons. 2018. PipeDream: Pipeline parallelism for DNN training. In Proceedings of the Conference on Systems and Machine Learning (SysML’18).
[59]
Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, and Phillip B. Gibbons. 2017. Proteus: Agile ML elasticity through tiered reliability in dynamic resource markets. In Proceedings of the 12th European Conference on Computer Systems (EuroSys’17). ACM, New York, NY, 589--604.
[60]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. 2019. TicTac: Accelerating distributed deep learning with communication scheduling. In Proceedings of the Conference on Systems and Machine Learning (SysML’19).
[61]
J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge, R. G. Dreslinski, J. Mars, and L. Tang. 2015. DjiNN and tonic: DNN as a service and its implications for future warehouse scale computers. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15). 27--40.
[62]
K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). 620--629.
[63]
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory F. Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. 2017. Deep learning scaling is predictable, empirically. Retrieved from http://arxiv.org/abs/1712.00409.
[64]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI’11). USENIX Association, Berkeley, CA, 295--308. Retrieved from http://dl.acm.org/citation.cfm?id=1972457.1972488.
[65]
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 29, 6 (Nov. 2012), 82--97.
[66]
G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504--507. arXiv:http://science.sciencemag.org/content/313/5786/504.full.pdf
[67]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.
[68]
M. D. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. 51, 6 (Feb. 2019).
[69]
Xuedong Huang, James Baker, and Raj Reddy. 2014. A historical perspective of speech recognition. Commun. ACM 57, 1 (Jan. 2014), 94--103.
[70]
Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. Retrieved from http://arxiv.org/abs/1811.06965.
[71]
Yuzhen Huang, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Fan Yang, Jinfeng Li, Yuying Guo, and James Cheng. 2018. FlexPS: Flexible parallelism control in parameter server architecture. Proc. VLDB Endow. 11, 5 (Jan. 2018), 566--579.
[72]
David H. Hubel and Torsten N. Wiesel. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 1 (1962), 106--154.
[73]
Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. 2014. An efficient approach for assessing hyperparameter importance. In Proceedings of the 31st International Conference on International Conference on Machine Learning Volume 32 (ICML’14). JMLR.org, I--754--I--762. http://dl.acm.org/citation.cfm?id=3044805.3044891
[74]
Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. FireCaffe: Near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).
[75]
Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2016. Let there be color!: Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans. Graph. 35, 4, Article 110 (July 2016), 11 pages.
[76]
V. Ishakian, V. Muthusamy, and A. Slominski. 2018. Serving deep learning models in a serverless platform. In Proceedings of the IEEE International Conference on Cloud Engineering (IC2E’18). 257--262.
[77]
Sebastian Jäger, Hans-Peter Zorn, Stefan Igel, and Christian Zirpins. 2018. Parallelized training of deep NN: Comparison of current concepts and frameworks. In Proceedings of the 2nd Workshop on Distributed Infrastructures for Deep Learning (DIDL’18). ACM, New York, NY, 15--20.
[78]
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. arXiv preprint arXiv:1901.05758.
[79]
Eunji Jeong, Joo Seong Jeong, Soojeong Kim, Gyeong-In Yu, and Byung-Gon Chun. 2018. Improving the expressiveness of deep learning frameworks with recursion. In Proceedings of the 13th EuroSys Conference (EuroSys’18). ACM, New York, NY.
[80]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. Retrieved from http://arxiv.org/abs/1408.5093.
[81]
Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. 2018. Exploring hidden dimensions in accelerating convolutional neural networks. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 2274--2283. Retrieved from http://proceedings.mlr.press/v80/jia18a.html.
[82]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. CoRR abs/1807.05358 (2018). Retrieved from http://arxiv.org/abs/1807.05358.
[83]
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). ACM, New York, NY, 463--478.
[84]
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5, 1 (2017), 339--351.
[85]
Rie Johnson and Tong Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural Information Processing Systems Volume 1 (NIPS’13). Curran Associates Inc., USA, 315--323. Retrieved from http://dl.acm.org/citation.cfm?id=2999611.2999647.
[86]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). 1--12.
[87]
Peter Karkus, David Hsu, and Wee Sun Lee. 2017. QMDP-net: Deep learning for planning under partial observability. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4694--4704. Retrieved from http://papers.nips.cc/paper/7055-qmdp-net-deep-learning-for-planning-under-partial-observability.pdf.
[88]
Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, and Eric P. Xing. 2016. STRADS: A distributed framework for scheduled model parallel machine learning. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). ACM, New York, NY.
[89]
Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Luo Mai, Paolo Costa, and Peter R. Pietzuch. 2019. CROSSBOW: Scaling deep learning with small batch sizes on multi-GPU servers. Retrieved from http://arxiv.org/abs/1901.02244.
[90]
Jakub Konec̆ný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. In Proceedings of the NIPS Workshop on Private Multi-Party Machine Learning. Retrieved from https://arxiv.org/abs/1610.05492.
[91]
Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. Retrieved from http://arxiv.org/abs/1404.5997.
[92]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems Volume 1 (NIPS’12). Curran Associates Inc., 1097--1105. Retrieved from http://dl.acm.org/citation.cfm?id=2999134.2999257.
[93]
Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data management in machine learning: Challenges, techniques, and systems. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). ACM, New York, NY, 1717--1722.
[94]
Intel. Data Management Tailored for Machine Learning Workloads in Kubernetes. Retrieved from https://www.intel.ai/kubernetes-volume-controller-kvc-data-management-tailored-for-machine-learning-workloads-in-kubernetes.
[95]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436.
[96]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278--2324.
[97]
Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU Myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 451--460.
[98]
Yun Seong Lee Lee, Markus Weimer, Youngseok Yang, and Gyeong-In Yu. 2016. Dolphin: Runtime optimization for distributed machine learning. In Proceedings of International Conference on Machine Learning (ICML’16).
[99]
Ian Lenz, Honglak Lee, and Ashutosh Saxena. 2015. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 34, 4 5 (2015), 705--724.
[100]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2019. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. Retrieved from http://arxiv.org/abs/1903.04611.
[101]
Hao Li, Asim Kadav, Erik Kruus, and Cristian Ungureanu. 2015. MALT: Distributed data-parallelism for existing ML applications. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15). ACM, New York, NY.
[102]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 583--598. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu.
[103]
Ping Li, Jin Li, Zhengan Huang, Tong Li, Chong-Zhi Gao, Siu-Ming Yiu, and Kai Chen. 2017. Multi-key privacy-preserving deep learning in cloud computing. Future Gen. Comput. Syst. 74 (2017), 76--85.
[104]
Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease.Ml: Towards multi-tenant resource sharing for machine learning workloads. Proc. VLDB Endow. 11, 5 (Jan. 2018), 607--620.
[105]
Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems. MIT Press, 5330--5340.
[106]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SkhQHMW0W.
[107]
Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. Sanchez. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis 42 (2017), 60--88.
[108]
Jiayi Liu, Jayanta Dutta, Nanxiang Li, Unmesh Kurup, and Mohak Shah. 2018. Usability study of distributed deep-learning frameworks for convolutional neural networks. In Proceedings of the Deep Learning Day at SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’18).
[109]
Gang Luo. 2016. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw. Model. Anal. Health Info. Bioinfo. 5, 1 (23 May 2016), 18.
[110]
Luo Mai, Chuntao Hong, and Paolo Costa. 2015. Optimizing network performance in distributed machine learning. In Proceedings of the 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’15). USENIX Association, Santa Clara, CA. Retrieved from https://www.usenix.org/conference/hotcloud15/workshop-program/presentation/mai.
[111]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’10). ACM, New York, NY, 135--146.
[112]
Dominic Masters and Carlo Luschi. 2018. Revisiting small batch training for deep neural networks. Retrieved from arxiv:1804.07612 http://arxiv.org/abs/1804.07612.
[113]
Ruben Mayer, Christian Mayer, and Larissa Laich. 2017. The tensorflow partitioning and scheduling problem: It’s the critical path!. In Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning (DIDL’17). ACM, New York, NY, 1--6.
[114]
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS’17). Retrieved from http://arxiv.org/abs/1602.05629.
[115]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine learning in apache spark. J. Mach. Learn. Res. 17, 34 (2016), 1--7. Retrieved from http://jmlr.org/papers/v17/15-237.html.
[116]
H. Miao, A. Li, L. S. Davis, and A. Deshpande. 2017. Towards unified data and lifecycle management for deep learning. In Proceedings of the IEEE 33rd International Conference on Data Engineering (ICDE’17). 571--582.
[117]
Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V. Le, and Jeff Dean. 2018. A hierarchical model for device placement. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=Hkc-TeZ0W.
[118]
Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, International Convention Centre, Sydney, Australia, 2430--2439. Retrieved from http://proceedings.mlr.press/v70/mirhoseini17a.html.
[119]
Volodymyr Mnih and Geoffrey E. Hinton. 2012. Learning to label aerial images from noisy data. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). 567--574.
[120]
Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I. Jordan. 2016. Sparknet: Training deep networks in spark. In Proceedings of International Conference on Learning Representations. arXiv preprint arXiv:1511.06051.
[121]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A distributed framework for emerging AI applications. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Carlsbad, CA, 561--577. Retrieved from https://www.usenix.org/conference/osdi18/presentation/moritz.
[122]
Apache MXNet. Distributed Training in MXNet. Retrieved from https://mxnet.incubator.apache.org/versions/master/faq/distributed_training.html.
[123]
Apache MXNet. Does mxnet support Stale Synchronous Parallel (aka. SSP). Retrieved from https://github.com/apache/incubator-mxnet/issues/841.
[124]
Apache MXNet. Gradient Compression : mxnet documentation. Retrieved from https://mxnet.incubator.apache.org/versions/master/faq/gradient_compression.html.
[125]
Apache MXNet. MXNet. Retrieved from https://mxnet.apache.org/.
[126]
Apache MXNet. MXNet Graph Optimization and Quantization based on subgraph and MKL-DNN. Retrieved from https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN.
[127]
Deepak Narayanan, Keshav Santhanam, Amar Phanishayee, and Matei Zaharia. 2018. Accelerating deep-learning workloads through efficient multi-model execution. In Proceedings of the NIPS Workshop on Systems for Machine Learning. Retrieved from https://www.microsoft.com/en-us/research/publication/accelerating-deep-learning-workloads-through-efficient-multi-model-execution/.
[128]
Tencent. Tencent ncnn. Retrieved from https://github.com/Tencent/ncnn.
[129]
Intel. Neon. Retrieved from https://github.com/NervanaSystems/neon.
[130]
Adrian Nilsson, Simon Smith, Gregor Ulm, Emil Gustavsson, and Mats Jirstrand. 2018. A performance evaluation of federated learning algorithms. In Proceedings of the 2nd Workshop on Distributed Infrastructures for Deep Learning (DIDL’18). ACM, New York, NY, 1--8.
[131]
Nils Nilsson. 2010. The Quest for Artificial Intelligence: A History of Ideas and Achievements. Cambridge University Press.
[132]
Robert Nishihara, Philipp Moritz, Stephanie Wang, Alexey Tumanov, William Paul, Johann Schleier-Smith, Richard Liaw, Mehrdad Niknami, Michael I. Jordan, and Ion Stoica. 2017. Real-time machine learning: The missing pieces. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS’17). ACM, New York, NY, 106--110.
[133]
Cyprien Noel and Simon Osindero. 2014. Dogwild!-Distributed hogwild for CPU 8 GPU. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations.
[134]
Beng Chin Ooi, Kian-Lee Tan, Sheng Wang, Wei Wang, Qingchao Cai, Gang Chen, Jinyang Gao, Zhaojing Luo, Anthony K. H. Tung, Yuan Wang, Zhongle Xie, Meihui Zhang, and Kaiping Zheng. 2015. SINGA: A distributed deep-learning platform. In Proceedings of the 23rd ACM International Conference on Multimedia (MM’15). ACM, New York, NY, 685--688.
[135]
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric Chung. 2015. Accelerating Deep Convolutional Neural Networks Using Specialized Hardware. Retrieved from https://www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/.
[136]
Or Ozeri, Effi Ofer, and Ronen Kat. 2018. Object storage for deep-learning frameworks. In Proceedings of the 2nd Workshop on Distributed Infrastructures for Deep Learning (DIDL’18). ACM, New York, NY, 21--24.
[137]
PaddlePaddle. PaddlePaddle. Retrieved from http://paddlepaddle.org/.
[138]
Jay H. Park, Sunghwan Kim, Jinwon Lee, Myeongjae Jeon, and Sam H. Noh. 2019. Accelerated training for CNN distributed deep learning through automatic resource-aware layer placement. arXiv preprint arXiv:1901.05803 (2019).
[139]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS-W’17).
[140]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (Oct. 2011), 2825--2830.
[141]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An efficient dynamic resource scheduler for deep-learning clusters. In Proceedings of the 13th EuroSys Conference (EuroSys’18). ACM, New York, NY.
[142]
Christian Pinto, Yiannis Gkoufas, Andrea Reale, Seetharami Seelam, and Steven Eliuk. 2018. Hoard: A distributed data caching system to accelerate deep-learning training on the cloud. Retrieved from http://arxiv.org/abs/1812.00669.
[143]
Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria Presa Reyes, Mei-Ling Shyu, Shu-Ching Chen, and S. S. Iyengar. 2018. A survey on deep-learning: Algorithms, techniques, and applications. ACM Comput. Surv. 51, 5 (Sept. 2018).
[144]
OpenMined. OpenMined/PySyft: A library for encrypted, privacy preserving deep learning. Retrieved from https://github.com/OpenMined/PySyft.
[145]
PyTorch Community. Multiprocessing Best Practies—PyTorch Master Documentation. Retrieved from https://pytorch.org/docs/stable/notes/multiprocessing.html.
[146]
Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, and Eric P. Xing. 2018. Litz: Elastic framework for high-performance distributed machine learning. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 631--644. https://www.usenix.org/conference/atc18/presentation/qiao.
[147]
Facebook. QNNPACK: Open-source library for optimized mobile deep learning. Retrieved from https://code.fb.com/ml-applications/qnnpack/.
[148]
Jeff Rasley, Yuxiong He, Feng Yan, Olatunji Ruwase, and Rodrigo Fonseca. 2017. HyperDrive: Exploring hyperparameters with POP scheduling. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware’17). ACM, New York, NY, 1--13.
[149]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. MIT Press, 693--701.
[150]
Probir Roy, Shuaiwen Leon Song, Sriram Krishnamoorthy, Abhinav Vishnu, Dipanjan Sengupta, and Xu Liu. 2018. NUMA-caffe: NUMA-aware deep-learning neural networks. ACM Trans. Archit. Code Optim. 15, 2 (June 2018).
[151]
Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. Retrieved from http://arxiv.org/abs/1609.04747.
[152]
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533.
[153]
Theo Ryffel, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Daniel Rueckert, and Jonathan Passerat-Palmbach. 2018. A generic framework for privacy preserving deep learning. Retrieved from http://arxiv.org/abs/1811.04017.
[154]
Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R. Aberger, Kunle Olukotun, and Christopher Ré. 2018. High-accuracy low-precision training. Retrieved from http://arxiv.org/abs/1803.03383.
[155]
SAS. SAS. Retrieved from https://www.sas.com/.
[156]
Jurgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Netw. 61 (2015), 85--117.
[157]
John R. Searle. 1980. Minds, brains, and programs. Behav. Brain Sci. 3, 3 (1980), 417--424.
[158]
Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). ACM, New York, NY, 2135--2135.
[159]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Proceedings of Interspeech. Retrieved from https://www.microsoft.com/en-us/research/publication/1-bit-stochastic-gradient-descent-and-application-to-data-parallel-distributed-training-of-speech-dnns/.
[160]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. Retrieved from http://arxiv.org/abs/1802.05799.
[161]
Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. 2018. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems. MIT Press, 10435--10444.
[162]
Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). ACM, New York, NY, 1310--1321.
[163]
Alex Smola. [n.d.]. What is the Parameter Server? Retrieved from https://www.quora.com/What-is-the-Parameter-Server.
[164]
Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 703--710.
[165]
Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 926--934. Retrieved from http://papers.nips.cc/paper/5028-reasoning-with-neural-tensor-networks-for-knowledge-base-completion.pdf.
[166]
Evan R. Sparks, Ameet Talwalkar, Daniel Haas, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. 2015. Automating model search for large scale machine learning. In Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC’15). ACM, New York, NY, 368--380.
[167]
Nikko Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Proceedings of the 16th Annual Conference of the International Speech Communication Association.
[168]
Sainbayar Sukhbaatar and Rob Fergus. 2014. Learning from noisy labels with deep neural networks. Retrieved from http://arxiv.org/abs/1406.2080.
[169]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).
[170]
Zeyi Tao and Qun Li. 2018. eSGD: Communication efficient distributed deep learning on the edge. In Proceedings of the USENIX Workshop on Hot Topics in Edge Computing (HotEdge'18). USENIX Association, Boston, MA. Retrieved from https://www.usenix.org/conference/hotedge18/presentation/tao.
[171]
TensorFlow. Distributed Training in TensorFlow. Retrieved from https://www.tensorflow.org/guide/distribute_strategy. https://www.tensorflow.org/guide/distribute_strategy.
[172]
TensorFlow. Post-training quantization. Retrieved from https://www.tensorflow.org/lite/performance/post_training_quantization.
[173]
TensorFlow. TensorFlow Federated. Retrieved from https://www.tensorflow.org/federated.
[174]
Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: A next-generation open-source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The 29th Annual Conference on Neural Information Processing Systems (NIPS’15). Retrieved from http://learningsys.org/papers/LearningSys_2015_paper_33.pdf.
[175]
Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103--111.
[176]
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop (NIPS’11).
[177]
Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi Viswanathan, Saadiyah Husnoo, Samuel Madden, and Matei Zaharia. 2016. ModelDB: A system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA’16). ACM, New York, NY.
[178]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC’13). ACM, New York, NY.
[179]
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15). ACM, New York, NY.
[180]
Abhinav Vishnu, Charles Siegel, and Jeffrey Daily. 2016. Distributed TensorFlow with MPI. Retrieved from http://arxiv.org/abs/1603.02339.
[181]
C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou. 2017. DLAU: A scalable deep-learning accelerator unit on FPGA. IEEE Trans. Comput.-Aid. Design Integr. Circ. Syst. 36, 3 (Mar. 2017), 513--517.
[182]
Minjie Wang, Tianjun Xiao, Jianpeng Li, Jiaxing Zhang, Chuntao Hong, and Zheng Zhang. 2014. Minerva: A scalable and highly efficient training platform for deep learning. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations.
[183]
Minjie Wang, Hucheng Zhou, Minyi Guo, and Zheng Zhang. 2014. A scalable and topology configurable protocol for distributed parameter synchronization. In Proceedings of 5th Asia-Pacific Workshop on Systems (APSys’14). ACM, New York, NY.
[184]
Shaoqi Wang, Wei Chen, Aidi Pi, and Xiaobo Zhou. 2018. Aggressive synchronization with partial processing for iterative ml jobs on clusters. In Proceedings of the 19th International Middleware Conference (Middleware’18). ACM, New York, NY, 253--265.
[185]
Wei Wang, Meihui Zhang, Gang Chen, H. V. Jagadish, Beng Chin Ooi, and Kian-Lee Tan. 2016. Database meets deep learning: Challenges and opportunities. SIGMOD Rec. 45, 2 (Sept. 2016), 17--22.
[186]
Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, and Peter Pietzuch. 2016. Ako: Decentralised deep learning with partial gradient exchange. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). ACM, New York, NY, 84--97.
[187]
Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2015. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC’15). ACM, New York, NY, 381--394.
[188]
Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, Brandon Myers, Shravan Narayanamurthy, Raghu Ramakrishnan, Sriram Rao, Russel Sears, Beysim Sezgin, and Julia Wang. 2015. REEF: Retainable evaluator execution framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, NY, 1343--1355.
[189]
wekadl4j [n.d.]. Retrieved from https://github.com/Waikato/wekaDeeplearning4j.
[190]
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. MIT Press, 1509--1519.
[191]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. Retrieved from http://arxiv.org/abs/1609.08144.
[192]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. A comprehensive survey on graph neural networks. Retrieved from http://arxiv.org/abs/1901.00596.
[193]
xgboost [n.d.]. XGBoost. Retrieved from https://xgboost.ai/.
[194]
Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. 2015. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).
[195]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective cluster scheduling for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Carlsbad, CA, 595--610. https://www.usenix.org/conference/osdi18/presentation/xiao
[196]
E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Trans. Big Data 1, 2 (June 2015), 49--67.
[197]
Feng Yan, Olatunji Ruwase, Yuxiong He, and Trishul Chilimbi. 2015. Performance modeling and scalability optimization of distributed deep-learning systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, New York, NY, 1355--1364.
[198]
T. Young, D. Hazarika, S. Poria, and E. Cambria. 2018. Recent trends in deep-learning-based natural language processing [review article]. IEEE Comput. Intell. Mag. 13, 3 (Aug. 2018), 55--75.
[199]
Yuan Yu, Martín Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat, Tim Harley, Peter Hawkins, Michael Isard, Manjunath Kudlur, Rajat Monga, Derek Murray, and Xiaoqiang Zheng. 2018. Dynamic control flow in large-scale machine learning. In Proceedings of the 13th EuroSys Conference (EuroSys’18). ACM, New York, NY.
[200]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI’12). USENIX, San Jose, CA, 15--28. Retrieved from https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia.
[201]
C. Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16). 1--8.
[202]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161--170.
[203]
C. Zhang, H. Tian, W. Wang, and F. Yan. 2018. Stay fresh: Speculative synchronization for fast distributed machine learning. In Proceedings of the IEEE 38th International Conference on Distributed Computing Systems (ICDCS’18). 99--109.
[204]
H. Zhang, C. Hsieh, and V. Akella. 2016. HogWild++: A new mechanism for decentralized asynchronous stochastic gradient descent. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 629--638.
[205]
Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J. Freedman. 2017. SLAQ: Quality-driven scheduling for distributed machine learning. In Proceedings of the Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, 390--404.
[206]
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’17). USENIX Association, Berkeley, CA, 181--193. Retrieved from http://dl.acm.org/citation.cfm?id=3154690.3154708.
[207]
Huasha Zhao and John Canny. [n.d.]. Butterfly mixing: Accelerating incremental-update algorithms on clusters. In Proceedings of the SIAM International Conference on Data Mining. 785--793. arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611972832.87
[208]
Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. 2016. Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).
[209]
Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Retrieved from http://arxiv.org/abs/1606.06160.
[210]
Barret Zoph and Quoc V Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations.
[211]
Yongqiang Zou, Xing Jin, Yi Li, Zhimao Guo, Eryu Wang, and Bin Xiao. 2014. Mariana: Tencent deep-learning platform and its applications. Proc. VLDB Endow. 7, 13 (Aug. 2014), 1772--1777.

Cited By

View all
  • (2025)AsyncFedGAN: An Efficient and Staleness-Aware Asynchronous Federated Learning Framework for Generative Adversarial NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.352101636:3(553-569)Online publication date: Mar-2025
  • (2025)Improved disease prediction using deep learningMoving Towards Everlasting Artificial Intelligent Battery-Powered Implants10.1016/B978-0-443-24830-6.00015-3(207-228)Online publication date: 2025
  • (2024)Consistent Vertical Federated Deep Learning Using Task-Driven Features to Construct Integrated IoT ServicesApplied Sciences10.3390/app14241197714:24(11977)Online publication date: 20-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 53, Issue 1
January 2021
781 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3382040
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 February 2020
Accepted: 01 September 2019
Revised: 01 July 2019
Received: 01 March 2019
Published in CSUR Volume 53, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tag

  1. Deep-learning systems

Qualifiers

  • Survey
  • Survey
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)452
  • Downloads (Last 6 weeks)38
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)AsyncFedGAN: An Efficient and Staleness-Aware Asynchronous Federated Learning Framework for Generative Adversarial NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.352101636:3(553-569)Online publication date: Mar-2025
  • (2025)Improved disease prediction using deep learningMoving Towards Everlasting Artificial Intelligent Battery-Powered Implants10.1016/B978-0-443-24830-6.00015-3(207-228)Online publication date: 2025
  • (2024)Consistent Vertical Federated Deep Learning Using Task-Driven Features to Construct Integrated IoT ServicesApplied Sciences10.3390/app14241197714:24(11977)Online publication date: 20-Dec-2024
  • (2024)Fast and scalable all-optical network architecture for distributed deep learningJournal of Optical Communications and Networking10.1364/JOCN.51169616:3(342)Online publication date: 22-Feb-2024
  • (2024)A Fusion Pretrained Approach for Identifying the Cause of Sarcasm RemarksINFORMS Journal on Computing10.1287/ijoc.2022.0285Online publication date: 13-Jun-2024
  • (2024)Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A SurveyACM Computing Surveys10.1145/370345357:4(1-35)Online publication date: 10-Dec-2024
  • (2024)TensAIR: Real-Time Training of Neural Networks from Data-streamsProceedings of the 2024 8th International Conference on Machine Learning and Soft Computing10.1145/3647750.3647762(73-82)Online publication date: 26-Jan-2024
  • (2024)A Survey on Performance Modeling and Prediction for Distributed DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347639035:12(2463-2478)Online publication date: 1-Dec-2024
  • (2024)Decentralized Federated Learning for Epileptic Seizures Detection in Low-Power Wearable SystemsIEEE Transactions on Mobile Computing10.1109/TMC.2023.3320862(1-16)Online publication date: 2024
  • (2024)HPPI: A High-Performance Photonic Interconnect Design for Chiplet-Based DNN AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332882843:3(812-825)Online publication date: 1-Mar-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media