Skip to main content
Log in

Abstract

Recently, deep neural networks (DNNs) significantly outperform Gaussian mixture models in acoustic modeling for speech recognition. However, the substantial increase in computational load during the inference stage makes deep models difficult to directly deploy on low-power embedded devices. To alleviate this issue, structure sparseness and low precision fixed-point quantization have been applied widely. In this work, binary neural networks for speech recognition are developed to reduce the computational cost during the inference stage. A fast implementation of binary matrix multiplication is introduced. On modern central processing unit (CPU) and graphics processing unit (GPU) architectures, a 5–7 times speedup compared with full precision floatingpoint matrix multiplication can be achieved in real applications. Several kinds of binary neural networks and related model optimization algorithms are developed for large vocabulary continuous speech recognition acoustic modeling. In addition, to improve the accuracy of binary models, knowledge distillation from the normal full precision floating-point model to the compressed binary model is explored. Experiments on the standard Switchboard speech recognition task show that the proposed binary neural networks can deliver 3–4 times speedup over the normal full precision deep models. With the knowledge distillation from the normal floating-point models, the binary DNNs or binary convolutional neural networks (CNNs) can restrict the word error rate (WER) degradation to within 15.0%, compared to the normal full precision floating-point DNNs or CNNs, respectively. Particularly for the binary CNN with binarization only on the convolutional layers, the WER degradation is very small and is almost negligible with the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bengio Y, Léonard N, Courville A, 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. https://arxiv.org/abs/1308.3432

  • Bi MX, Qian YM, Yu K, 2015. Very deep convolutional neural networks for LVCSR. 16th Annual Conf of Int Speech Communication Association, p.3259–3263.

  • Chen ZH, Zhuang YM, Qian YM, et al., 2017. Phone synchronous speech recognition with CTC lattices. IEEE/ACM Trans Audio Speech Lang Process, 25(1):90–101. https://doi.org/10.1109/TASLP.2016.2625459

    Article  Google Scholar 

  • Chen ZH, Luitjens J, Xu HN, et al., 2018a. A GPU-based WFST decoder with exact lattice generation. https://arxiv.org/abs/1804.03243

  • Chen ZH, Liu Q, Li H, et al., 2018b. On modular training of neural acoustics-to-word model for LVCSR. IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4754–4758. https://doi.org/10.1109/ICASSP.2018.8461361

  • Chen ZH, Droppo J, Li JY, et al., 2018c. Progressive joint modeling in unsupervised single-channel overlapped speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 26(1):184–196. https://doi.org/10.1109/TASLP.2017.2765834

    Article  Google Scholar 

  • Collobert R, Kavukcuoglu K, Farabet C, 2011. Torch7: a Matlab-like environment for machine learning. BigLearn NIPS Workshop.

  • Courbariaux M, Hubara I, Soudry D, et al., 2016. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or −1. https://arxiv.org/abs/1602.02830

  • Dahl GE, Yu D, Deng L, et al., 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process, 20(1):30–42. https://doi.org/10.1109/tasl.2011.2134090

    Article  Google Scholar 

  • Denil M, Shakibi B, Dinh L, et al., 2013. Predicting parameters in deep learning. 26th Int Conf on Neural Information Processing Systems, p.2148–2156.

  • Duchi J, Hazan E, Singer Y, 2011. Adaptive subgradient methods for online learning and stochastic optimization. JMachLearnRes, 12:2121–2159.

    MathSciNet  MATH  Google Scholar 

  • Goto K, van de Geijn RA, 2008. Anatomy of highperformance matrix multiplication. ACM Trans Mat Softw, 34(3), Article 12. https://doi.org/10.1145/1356052.1356053

  • Gupta S, Agrawal A, Gopalakrishnan K, et al., 2015. Deep learning with limited numerical precision. Proc 32nd Int Conf on Machine Learning, p.1737–1746.

  • Hammarlund P, Martinez AJ, Bajwa AA, et al., 2014. Haswell: the fourth-generation Intel core processor. IEEE Micro, 34(2):6–20. https://doi.org/10.1109/MM.2014.10

    Article  Google Scholar 

  • Han S, Pool J, Tran J, et al., 2015. Learning both weights and connections for efficient neural network. Proc 28th Int Conf on Neural Information Processing Systems, p.1135–1143.

  • Han S, Kang JL, Mao HZ, et al., 2017. ESE: efficient speech recognition engine with sparse LSTM on FPGA. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.75–84. https://doi.org/10.1145/3020078.3021745

  • He TX, Fan YC, Qian YM, et al., 2014. Reshaping deep neural network for fast decoding by node-pruning. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.245–249. https://doi.org/10.1109/ICASSP.2014.6853595

  • Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82–97. https://doi.org/10.1109/msp.2012.2205597

    Article  Google Scholar 

  • Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531

  • Hubara I, Courbariaux M, Soudry D, et al., 2016. Quantized neural networks: training neural networks with low precision weights and activations. https://arxiv.org/abs/1609.07061

  • Ioffe S, Szegedy C, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. 32nd Int Conf on Machine Learning, p.448–456.

  • Jaitly N, Nguyen P, Senior A, et al., 2012. Application of pretrained deep neural networks to large vocabulary speech recognition. Proc 13th Annual Conf of the Int Speech Communication Association.

  • Kingma D, Ba J, 2014. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980

  • Li JY, Seltzer ML, Wang X, et al., 2017. Large-scale domain adaptation via teacher-student learning. Proc 18th Annual Conf of Int Speech Communication Association, p.2386–2390. https://doi.org/10.21437/Interspeech.2017-519

  • Low TM, Igual FD, Smith TM, et al., 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans Math Softw, 43(2), Article 12. https://doi.org/10.1145/2925987

  • Lu L, Renals S, 2017. Small-footprint highway deep neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 25(7):1502–1511. https://doi.org/10.1109/TASLP.2017.2698723

    Article  Google Scholar 

  • Lu L, Guo M, Renals S, 2017. Knowledge distillation for small-footprint highway networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4820–4824. https://doi.org/10.1109/ICASSP.2017.7953072

  • Mohamed AR, Dahl GE, Hinton GE, 2012. Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process, 20(1):14–22. https://doi.org/10.1109/TASL.2011.2109382

    Article  Google Scholar 

  • Novikov A, Podoprikhin D, Osokin A, et al., 2015. Tensorizing neural networks. Advances in Neural Information Processing Systems, p.442–450.

  • Povey D, Ghoshal A, Boulianne G, et al., 2011. The Kaldi speech recognition toolkit. Proc IEEE Workshop on Automatic Speech Recognition and Understanding.

  • Qian YM, Woodland PC, 2016. Very deep convolutional neural networks for robust speech recognition. Proc IEEE Spoken Language Technology Workshop, p.481–488. https://doi.org/10.1109/SLT.2016.7846307

  • Qian YM, He TX, Deng W, et al., 2015. Automatic model redundancy reduction for fast back-propagation for deep neural networks in speech recognition. Proc Int Joint Conf on Neural Networks, p.1–6. https://doi.org/10.1109/IJCNN.2015.7280335

  • Qian YM, Bi MX, Tan T, et al., 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 24(12):2263–2276. https://doi.org/10.1109/TASLP.2016.2602884

    Article  Google Scholar 

  • Rastegari M, Ordonez V, Redmon J, et al., 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. Proc 14th European Conf on Computer Vision, p.525–542. https://doi.org/10.1007/978-3-319-46493-0_32

  • Sainath TN, Mohamed AR, Kingsbury B, et al., 2013. Deep convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8614–8618. https://doi.org/10.1109/ICASSP.2013.6639347

  • Sak H, Senior A, Beaufays F, 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Proc 15th Annual Conf of Int Speech Communication Association, p.338–342.

  • Saon G, Kurata G, Sercu T, et al., 2017. English conversational telephone speech recognition by humans and machines. https://arxiv.org/abs/1703.02136

  • Sercu T, Puhrsch C, Kingsbury B, et al., 2016. Very deep multilingual convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4955–4959. https://doi.org/10.1109/icassp.2016.7472620

  • Wang YQ, Li JY, Gong YF, 2015. Small-footprint highperformance deep neural network-based speech recognition using split-VQ. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4984–4988. https://doi.org/10.1109/ICASSP.2015.7178919

  • Xiong W, Droppo J, Huang X, et al., 2016. Achieving human parity in conversational speech recognition. https://arxiv.org/abs/1610.05256

  • Xiong W, Droppo J, Huang X, et al., 2017. The Microsoft 2016 conversational speech recognition system. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.5255–5259. https://doi.org/10.1109/icassp.2017.7953159

  • Xue J, Li JY, Gong YF, 2013. Restructuring of deep neural network acoustic models with singular value decomposition. Proc 14th Annual Conf of Int Speech Communication Association, p.2365–2369.

  • Young S, Evermann G, Gales M, et al., 2006. The HTK Book. Cambridge University Engineering Department, Cambridge, UK.

    Google Scholar 

  • Yu D, Seide F, Li G, et al., 2012. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4409–4412. https://doi.org/10.1109/ICASSP.2012.6288897

  • Yu D, Xiong W, Droppo J, et al., 2016. Deep convolutional neural networks with layer-wise context expansion and attention. Proc 17th Annual Conf of Int Speech Communication Association, p.17–21. https://doi.org/10.21437/Interspeech.2016-251

  • Zhou SC, Wu YX, Ni ZK, et al., 2016. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. https://arxiv.org/abs/1606.06160

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan-min Qian.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61603252 and U1736202) and experiments have been carried out on the Pi supercomputer at Shanghai Jiao Tong University

A preliminary version was presented at the 18th Annual Conference of the International Speech Communication Association, August 20–24, 2017, Sweden

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qian, Ym., Xiang, X. Binary neural networks for speech recognition. Frontiers Inf Technol Electronic Eng 20, 701–715 (2019). https://doi.org/10.1631/FITEE.1800469

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.1800469

Key words

CLC number

Navigation