research-article

Accelerating deep neural network learning for speech recognition on a cluster of GPUs

Authors:

Brian Kingsbury,

Soumyadip Gosh,

Fan ZhouAuthors Info & Claims

MLHPC'17: Proceedings of the Machine Learning on HPC Environments

Article No.: 3, Pages 1 - 8

https://doi.org/10.1145/3146347.3146351

Published: 12 November 2017 Publication History

Abstract

We train deep neural networks to solve the acoustic modeling problem for large-vocabulary continuous speech recognition. We employ distributed processing using a cluster of GPUs. On modern GPUs, the sequential implementation takes over a day to train, and efficient parallelization without losing accuracy is notoriously hard. We show that ASGD methods for parallelization are not efficient for this application. Even with 4 GPUs, the overhead is significant, and the accuracies achieved are poor. We adapt a P-learner K-step model averaging algorithm that with 4 GPUs achieves accuracies comparable to that achieved by the sequential implementation. We further introduce adaptive measures that make our parallel implementation scale to the full cluster of 20 GPUs. Ultimately our parallel implementation achieves better accuracies than the sequential implementation with a 6.1 times speedup.

References

[1]

L Bottou, F. E. Curtis, and J Nocedal. Optimization methods for large-scale machine learning. 2017.

[2]

R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for machine learning. Math. Programming, 134, 2012.

Digital Library

[3]

NVIDIA cuDNN -- GPU accelerated deep learning, https://developer.nvidia.com/cudnn.

[4]

J. Dean, G. Corrado, R. Monga, and et al. Large scale distributed deep networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223--1231. Curran Associates, Inc., 2012.

[5]

O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165--202, 2012.

[6]

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121--2159, July 2011.

Digital Library

[7]

J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus for research and development. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 517--520, 1992. http://ieeexplore.ieee.org/document/225858/.

[8]

F Hashemi, S Ghosh, and R Pasupathy. On adaptive sampling rules for stochastic recursions. In Proc. 2014 Winter Simulation Conference, pages 3959--3970, 2014.

[9]

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, November 1997. http://www.mitpressjournals.org/doi/pdfplus/10.1162/neco.1997.9.8.1735.

Digital Library

[10]

D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.

[11]

2000 HUB5 English evaluation speech. https://catalog.ldc.upenn.edu/LDC2002S09.

[12]

2000 HUB5 English evaluation transcripts. https://catalog.ldc.upenn.edu/LDC2002T43.

[13]

Switchboard-1 release 2. https://catalog.ldc.upenn.edu/LDC97S62.

[14]

M. Li, D. G. Andersen, J. W. Park, and et al. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583--598, Broomfileld, CO, October 2014. USENIX Association.

Digital Library

[15]

X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737--2745, 2015.

Digital Library

[16]

A. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, and G. Penn. Deep bi-directional recurrent networks over spectral windows. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015. http://ieeexplore.ieee.org/document/7404777/.

[17]

N. Morgan and H. Bourlard. An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Processing Magazine, 12(3):25--42, May 1995.

[18]

mpiT--MPI for Torch, https://github.com/sixin-zh/mpiT.

[19]

R Pasupathy, P Glynn, S Ghosh, and FHashemi. On sampling rates in simulation-based recursions. SIAM Journal of Optimization, 2017. in revisions.

[20]

J. Picone. Switchboard resegmentation project. https://www.isip.piconepress.com/projects/switchboard/.

[21]

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.

Digital Library

[22]

T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 6655--6659, 2013. http://ieeexplore.ieee.org/abstract/document/6638949/.

[23]

Torch -- A scientific computing framework for Luajit, http://torch.ch.

[24]

S. J. Young, J. J. Odell, and P. C. Woodland. Tree-based state tying for high accuracy modelling. In Proc. Workshop on Human Language Technology, pages 307--312, 1994. http://aclweb.org/anthology/H/H94/H94-1062.pdf.

Digital Library

[25]

S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, pages 685--693, 2015.

[26]

F. Zhou and G. Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv: 1708.01012 [cs.LG], August 2017. https://arxiv.org/abs/1708.01012.

Cited By

Cong GKingsbury BYang CLiu T(2020)Fast Training of Deep Neural Networks for Speech RecognitionICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9053993(6884-6888)Online publication date: May-2020
https://doi.org/10.1109/ICASSP40776.2020.9053993
Nassif AShahin IAttili IAzzeh MShaalan K(2019)Speech Recognition Using Deep Neural Networks: A Systematic ReviewIEEE Access10.1109/ACCESS.2019.28968807(19143-19165)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2896880
Keene SMelianas AFuller Evan de Burgt YTalin ASalleo A(2018)Optimized pulsed write schemes improve linearity and write speed for low-power organic neuromorphic devicesJournal of Physics D: Applied Physics10.1088/1361-6463/aabe7051:22(224002)Online publication date: 8-May-2018
https://doi.org/10.1088/1361-6463/aabe70
Show More Cited By

Recommendations

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Efficient SIMD implementation for accelerating convolutional neural network
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

Convolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through ...
Accelerating Radiative Transfer Simulation on NVIDIA GPUs with OpenACC
Parallel and Distributed Computing, Applications and Technologies
Abstract
To accelerate multiphysics applications, making use of not only GPUs but also FPGAs has been emerging. Multiphysics applications are simulations involving multiple physical models and multiple simultaneous physical phenomena. Operations with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MLHPC'17: Proceedings of the Machine Learning on HPC Environments

November 2017

81 pages

ISBN:9781450351379

DOI:10.1145/3146347

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

CO, Denver, USA

Acceptance Rates

Overall Acceptance Rate 5 of 7 submissions, 71%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
173
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cong GKingsbury BYang CLiu T(2020)Fast Training of Deep Neural Networks for Speech RecognitionICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9053993(6884-6888)Online publication date: May-2020
https://doi.org/10.1109/ICASSP40776.2020.9053993
Nassif AShahin IAttili IAzzeh MShaalan K(2019)Speech Recognition Using Deep Neural Networks: A Systematic ReviewIEEE Access10.1109/ACCESS.2019.28968807(19143-19165)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2896880
Keene SMelianas AFuller Evan de Burgt YTalin ASalleo A(2018)Optimized pulsed write schemes improve linearity and write speed for low-power organic neuromorphic devicesJournal of Physics D: Applied Physics10.1088/1361-6463/aabe7051:22(224002)Online publication date: 8-May-2018
https://doi.org/10.1088/1361-6463/aabe70
Du XKuang DYe YLi XChen MDu YWu W(2018)Comparative Study of Distributed Deep Learning Tools on SupercomputersAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-05051-1_9(122-137)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-3-030-05051-1_9

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten