skip to main content
10.1145/3146347.3146351acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Accelerating deep neural network learning for speech recognition on a cluster of GPUs

Published: 12 November 2017 Publication History

Abstract

We train deep neural networks to solve the acoustic modeling problem for large-vocabulary continuous speech recognition. We employ distributed processing using a cluster of GPUs. On modern GPUs, the sequential implementation takes over a day to train, and efficient parallelization without losing accuracy is notoriously hard. We show that ASGD methods for parallelization are not efficient for this application. Even with 4 GPUs, the overhead is significant, and the accuracies achieved are poor. We adapt a P-learner K-step model averaging algorithm that with 4 GPUs achieves accuracies comparable to that achieved by the sequential implementation. We further introduce adaptive measures that make our parallel implementation scale to the full cluster of 20 GPUs. Ultimately our parallel implementation achieves better accuracies than the sequential implementation with a 6.1 times speedup.

References

[1]
L Bottou, F. E. Curtis, and J Nocedal. Optimization methods for large-scale machine learning. 2017.
[2]
R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for machine learning. Math. Programming, 134, 2012.
[3]
NVIDIA cuDNN -- GPU accelerated deep learning, https://developer.nvidia.com/cudnn.
[4]
J. Dean, G. Corrado, R. Monga, and et al. Large scale distributed deep networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223--1231. Curran Associates, Inc., 2012.
[5]
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165--202, 2012.
[6]
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121--2159, July 2011.
[7]
J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus for research and development. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 517--520, 1992. http://ieeexplore.ieee.org/document/225858/.
[8]
F Hashemi, S Ghosh, and R Pasupathy. On adaptive sampling rules for stochastic recursions. In Proc. 2014 Winter Simulation Conference, pages 3959--3970, 2014.
[9]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, November 1997. http://www.mitpressjournals.org/doi/pdfplus/10.1162/neco.1997.9.8.1735.
[10]
D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
[11]
2000 HUB5 English evaluation speech. https://catalog.ldc.upenn.edu/LDC2002S09.
[12]
2000 HUB5 English evaluation transcripts. https://catalog.ldc.upenn.edu/LDC2002T43.
[13]
Switchboard-1 release 2. https://catalog.ldc.upenn.edu/LDC97S62.
[14]
M. Li, D. G. Andersen, J. W. Park, and et al. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583--598, Broomfileld, CO, October 2014. USENIX Association.
[15]
X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737--2745, 2015.
[16]
A. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, and G. Penn. Deep bi-directional recurrent networks over spectral windows. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015. http://ieeexplore.ieee.org/document/7404777/.
[17]
N. Morgan and H. Bourlard. An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Processing Magazine, 12(3):25--42, May 1995.
[18]
mpiT--MPI for Torch, https://github.com/sixin-zh/mpiT.
[19]
R Pasupathy, P Glynn, S Ghosh, and FHashemi. On sampling rates in simulation-based recursions. SIAM Journal of Optimization, 2017. in revisions.
[20]
J. Picone. Switchboard resegmentation project. https://www.isip.piconepress.com/projects/switchboard/.
[21]
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.
[22]
T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 6655--6659, 2013. http://ieeexplore.ieee.org/abstract/document/6638949/.
[23]
Torch -- A scientific computing framework for Luajit, http://torch.ch.
[24]
S. J. Young, J. J. Odell, and P. C. Woodland. Tree-based state tying for high accuracy modelling. In Proc. Workshop on Human Language Technology, pages 307--312, 1994. http://aclweb.org/anthology/H/H94/H94-1062.pdf.
[25]
S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, pages 685--693, 2015.
[26]
F. Zhou and G. Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv: 1708.01012 [cs.LG], August 2017. https://arxiv.org/abs/1708.01012.

Cited By

View all
  • (2020)Fast Training of Deep Neural Networks for Speech RecognitionICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9053993(6884-6888)Online publication date: May-2020
  • (2019)Speech Recognition Using Deep Neural Networks: A Systematic ReviewIEEE Access10.1109/ACCESS.2019.28968807(19143-19165)Online publication date: 2019
  • (2018)Optimized pulsed write schemes improve linearity and write speed for low-power organic neuromorphic devicesJournal of Physics D: Applied Physics10.1088/1361-6463/aabe7051:22(224002)Online publication date: 8-May-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MLHPC'17: Proceedings of the Machine Learning on HPC Environments
November 2017
81 pages
ISBN:9781450351379
DOI:10.1145/3146347
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 5 of 7 submissions, 71%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Fast Training of Deep Neural Networks for Speech RecognitionICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9053993(6884-6888)Online publication date: May-2020
  • (2019)Speech Recognition Using Deep Neural Networks: A Systematic ReviewIEEE Access10.1109/ACCESS.2019.28968807(19143-19165)Online publication date: 2019
  • (2018)Optimized pulsed write schemes improve linearity and write speed for low-power organic neuromorphic devicesJournal of Physics D: Applied Physics10.1088/1361-6463/aabe7051:22(224002)Online publication date: 8-May-2018
  • (2018)Comparative Study of Distributed Deep Learning Tools on SupercomputersAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-05051-1_9(122-137)Online publication date: 7-Dec-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media