Binary neural networks for speech recognition

Qian, Yan-min; Xiang, Xu

doi:10.1631/FITEE.1800469

263 Accesses
14 Citations
Explore all metrics

Abstract

Recently, deep neural networks (DNNs) significantly outperform Gaussian mixture models in acoustic modeling for speech recognition. However, the substantial increase in computational load during the inference stage makes deep models difficult to directly deploy on low-power embedded devices. To alleviate this issue, structure sparseness and low precision fixed-point quantization have been applied widely. In this work, binary neural networks for speech recognition are developed to reduce the computational cost during the inference stage. A fast implementation of binary matrix multiplication is introduced. On modern central processing unit (CPU) and graphics processing unit (GPU) architectures, a 5–7 times speedup compared with full precision floatingpoint matrix multiplication can be achieved in real applications. Several kinds of binary neural networks and related model optimization algorithms are developed for large vocabulary continuous speech recognition acoustic modeling. In addition, to improve the accuracy of binary models, knowledge distillation from the normal full precision floating-point model to the compressed binary model is explored. Experiments on the standard Switchboard speech recognition task show that the proposed binary neural networks can deliver 3–4 times speedup over the normal full precision deep models. With the knowledge distillation from the normal floating-point models, the binary DNNs or binary convolutional neural networks (CNNs) can restrict the word error rate (WER) degradation to within 15.0%, compared to the normal full precision floating-point DNNs or CNNs, respectively. Particularly for the binary CNN with binarization only on the convolutional layers, the WER degradation is very small and is almost negligible with the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on the long short-term memory model

Article 13 May 2020

Deep learning for time series classification: a review

Article 02 March 2019

Automatic speech recognition: a survey

Article 10 November 2020

References

Bengio Y, Léonard N, Courville A, 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. https://arxiv.org/abs/1308.3432
Bi MX, Qian YM, Yu K, 2015. Very deep convolutional neural networks for LVCSR. 16^th Annual Conf of Int Speech Communication Association, p.3259–3263.
Chen ZH, Zhuang YM, Qian YM, et al., 2017. Phone synchronous speech recognition with CTC lattices. IEEE/ACM Trans Audio Speech Lang Process, 25(1):90–101. https://doi.org/10.1109/TASLP.2016.2625459
Article Google Scholar
Chen ZH, Luitjens J, Xu HN, et al., 2018a. A GPU-based WFST decoder with exact lattice generation. https://arxiv.org/abs/1804.03243
Chen ZH, Liu Q, Li H, et al., 2018b. On modular training of neural acoustics-to-word model for LVCSR. IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4754–4758. https://doi.org/10.1109/ICASSP.2018.8461361
Chen ZH, Droppo J, Li JY, et al., 2018c. Progressive joint modeling in unsupervised single-channel overlapped speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 26(1):184–196. https://doi.org/10.1109/TASLP.2017.2765834
Article Google Scholar
Collobert R, Kavukcuoglu K, Farabet C, 2011. Torch7: a Matlab-like environment for machine learning. BigLearn NIPS Workshop.
Courbariaux M, Hubara I, Soudry D, et al., 2016. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or −1. https://arxiv.org/abs/1602.02830
Dahl GE, Yu D, Deng L, et al., 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process, 20(1):30–42. https://doi.org/10.1109/tasl.2011.2134090
Article Google Scholar
Denil M, Shakibi B, Dinh L, et al., 2013. Predicting parameters in deep learning. 26^th Int Conf on Neural Information Processing Systems, p.2148–2156.
Duchi J, Hazan E, Singer Y, 2011. Adaptive subgradient methods for online learning and stochastic optimization. JMachLearnRes, 12:2121–2159.
MathSciNet MATH Google Scholar
Goto K, van de Geijn RA, 2008. Anatomy of highperformance matrix multiplication. ACM Trans Mat Softw, 34(3), Article 12. https://doi.org/10.1145/1356052.1356053
Gupta S, Agrawal A, Gopalakrishnan K, et al., 2015. Deep learning with limited numerical precision. Proc 32^nd Int Conf on Machine Learning, p.1737–1746.
Hammarlund P, Martinez AJ, Bajwa AA, et al., 2014. Haswell: the fourth-generation Intel core processor. IEEE Micro, 34(2):6–20. https://doi.org/10.1109/MM.2014.10
Article Google Scholar
Han S, Pool J, Tran J, et al., 2015. Learning both weights and connections for efficient neural network. Proc 28^th Int Conf on Neural Information Processing Systems, p.1135–1143.
Han S, Kang JL, Mao HZ, et al., 2017. ESE: efficient speech recognition engine with sparse LSTM on FPGA. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.75–84. https://doi.org/10.1145/3020078.3021745
He TX, Fan YC, Qian YM, et al., 2014. Reshaping deep neural network for fast decoding by node-pruning. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.245–249. https://doi.org/10.1109/ICASSP.2014.6853595
Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82–97. https://doi.org/10.1109/msp.2012.2205597
Article Google Scholar
Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531
Hubara I, Courbariaux M, Soudry D, et al., 2016. Quantized neural networks: training neural networks with low precision weights and activations. https://arxiv.org/abs/1609.07061
Ioffe S, Szegedy C, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. 32^nd Int Conf on Machine Learning, p.448–456.
Jaitly N, Nguyen P, Senior A, et al., 2012. Application of pretrained deep neural networks to large vocabulary speech recognition. Proc 13^th Annual Conf of the Int Speech Communication Association.
Kingma D, Ba J, 2014. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980
Li JY, Seltzer ML, Wang X, et al., 2017. Large-scale domain adaptation via teacher-student learning. Proc 18^th Annual Conf of Int Speech Communication Association, p.2386–2390. https://doi.org/10.21437/Interspeech.2017-519
Low TM, Igual FD, Smith TM, et al., 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans Math Softw, 43(2), Article 12. https://doi.org/10.1145/2925987
Lu L, Renals S, 2017. Small-footprint highway deep neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 25(7):1502–1511. https://doi.org/10.1109/TASLP.2017.2698723
Article Google Scholar
Lu L, Guo M, Renals S, 2017. Knowledge distillation for small-footprint highway networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4820–4824. https://doi.org/10.1109/ICASSP.2017.7953072
Mohamed AR, Dahl GE, Hinton GE, 2012. Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process, 20(1):14–22. https://doi.org/10.1109/TASL.2011.2109382
Article Google Scholar
Novikov A, Podoprikhin D, Osokin A, et al., 2015. Tensorizing neural networks. Advances in Neural Information Processing Systems, p.442–450.
Povey D, Ghoshal A, Boulianne G, et al., 2011. The Kaldi speech recognition toolkit. Proc IEEE Workshop on Automatic Speech Recognition and Understanding.
Qian YM, Woodland PC, 2016. Very deep convolutional neural networks for robust speech recognition. Proc IEEE Spoken Language Technology Workshop, p.481–488. https://doi.org/10.1109/SLT.2016.7846307
Qian YM, He TX, Deng W, et al., 2015. Automatic model redundancy reduction for fast back-propagation for deep neural networks in speech recognition. Proc Int Joint Conf on Neural Networks, p.1–6. https://doi.org/10.1109/IJCNN.2015.7280335
Qian YM, Bi MX, Tan T, et al., 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 24(12):2263–2276. https://doi.org/10.1109/TASLP.2016.2602884
Article Google Scholar
Rastegari M, Ordonez V, Redmon J, et al., 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. Proc 14^th European Conf on Computer Vision, p.525–542. https://doi.org/10.1007/978-3-319-46493-0_32
Sainath TN, Mohamed AR, Kingsbury B, et al., 2013. Deep convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8614–8618. https://doi.org/10.1109/ICASSP.2013.6639347
Sak H, Senior A, Beaufays F, 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Proc 15^th Annual Conf of Int Speech Communication Association, p.338–342.
Saon G, Kurata G, Sercu T, et al., 2017. English conversational telephone speech recognition by humans and machines. https://arxiv.org/abs/1703.02136
Sercu T, Puhrsch C, Kingsbury B, et al., 2016. Very deep multilingual convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4955–4959. https://doi.org/10.1109/icassp.2016.7472620
Wang YQ, Li JY, Gong YF, 2015. Small-footprint highperformance deep neural network-based speech recognition using split-VQ. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4984–4988. https://doi.org/10.1109/ICASSP.2015.7178919
Xiong W, Droppo J, Huang X, et al., 2016. Achieving human parity in conversational speech recognition. https://arxiv.org/abs/1610.05256
Xiong W, Droppo J, Huang X, et al., 2017. The Microsoft 2016 conversational speech recognition system. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.5255–5259. https://doi.org/10.1109/icassp.2017.7953159
Xue J, Li JY, Gong YF, 2013. Restructuring of deep neural network acoustic models with singular value decomposition. Proc 14^th Annual Conf of Int Speech Communication Association, p.2365–2369.
Young S, Evermann G, Gales M, et al., 2006. The HTK Book. Cambridge University Engineering Department, Cambridge, UK.
Google Scholar
Yu D, Seide F, Li G, et al., 2012. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4409–4412. https://doi.org/10.1109/ICASSP.2012.6288897
Yu D, Xiong W, Droppo J, et al., 2016. Deep convolutional neural networks with layer-wise context expansion and attention. Proc 17^th Annual Conf of Int Speech Communication Association, p.17–21. https://doi.org/10.21437/Interspeech.2016-251
Zhou SC, Wu YX, Ni ZK, et al., 2016. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. https://arxiv.org/abs/1606.06160

Download references

Author information

Authors and Affiliations

Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Yan-min Qian & Xu Xiang
SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Yan-min Qian & Xu Xiang

Authors

Yan-min Qian
View author publications
You can also search for this author in PubMed Google Scholar
Xu Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan-min Qian.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61603252 and U1736202) and experiments have been carried out on the Pi supercomputer at Shanghai Jiao Tong University

A preliminary version was presented at the 18^th Annual Conference of the International Speech Communication Association, August 20–24, 2017, Sweden

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qian, Ym., Xiang, X. Binary neural networks for speech recognition. Frontiers Inf Technol Electronic Eng 20, 701–715 (2019). https://doi.org/10.1631/FITEE.1800469

Download citation

Received: 26 August 2018
Accepted: 23 December 2018
Published: 18 June 2019
Issue Date: May 2019
DOI: https://doi.org/10.1631/FITEE.1800469

Key words

CLC number

TP391.4

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Binary neural networks for speech recognition

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Deep learning for time series classification: a review

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Binary neural networks for speech recognition

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Deep learning for time series classification: a review

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation