Abstract
Recently, deep neural networks (DNNs) significantly outperform Gaussian mixture models in acoustic modeling for speech recognition. However, the substantial increase in computational load during the inference stage makes deep models difficult to directly deploy on low-power embedded devices. To alleviate this issue, structure sparseness and low precision fixed-point quantization have been applied widely. In this work, binary neural networks for speech recognition are developed to reduce the computational cost during the inference stage. A fast implementation of binary matrix multiplication is introduced. On modern central processing unit (CPU) and graphics processing unit (GPU) architectures, a 5–7 times speedup compared with full precision floatingpoint matrix multiplication can be achieved in real applications. Several kinds of binary neural networks and related model optimization algorithms are developed for large vocabulary continuous speech recognition acoustic modeling. In addition, to improve the accuracy of binary models, knowledge distillation from the normal full precision floating-point model to the compressed binary model is explored. Experiments on the standard Switchboard speech recognition task show that the proposed binary neural networks can deliver 3–4 times speedup over the normal full precision deep models. With the knowledge distillation from the normal floating-point models, the binary DNNs or binary convolutional neural networks (CNNs) can restrict the word error rate (WER) degradation to within 15.0%, compared to the normal full precision floating-point DNNs or CNNs, respectively. Particularly for the binary CNN with binarization only on the convolutional layers, the WER degradation is very small and is almost negligible with the proposed approach.
Similar content being viewed by others
References
Bengio Y, Léonard N, Courville A, 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. https://arxiv.org/abs/1308.3432
Bi MX, Qian YM, Yu K, 2015. Very deep convolutional neural networks for LVCSR. 16th Annual Conf of Int Speech Communication Association, p.3259–3263.
Chen ZH, Zhuang YM, Qian YM, et al., 2017. Phone synchronous speech recognition with CTC lattices. IEEE/ACM Trans Audio Speech Lang Process, 25(1):90–101. https://doi.org/10.1109/TASLP.2016.2625459
Chen ZH, Luitjens J, Xu HN, et al., 2018a. A GPU-based WFST decoder with exact lattice generation. https://arxiv.org/abs/1804.03243
Chen ZH, Liu Q, Li H, et al., 2018b. On modular training of neural acoustics-to-word model for LVCSR. IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4754–4758. https://doi.org/10.1109/ICASSP.2018.8461361
Chen ZH, Droppo J, Li JY, et al., 2018c. Progressive joint modeling in unsupervised single-channel overlapped speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 26(1):184–196. https://doi.org/10.1109/TASLP.2017.2765834
Collobert R, Kavukcuoglu K, Farabet C, 2011. Torch7: a Matlab-like environment for machine learning. BigLearn NIPS Workshop.
Courbariaux M, Hubara I, Soudry D, et al., 2016. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or −1. https://arxiv.org/abs/1602.02830
Dahl GE, Yu D, Deng L, et al., 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process, 20(1):30–42. https://doi.org/10.1109/tasl.2011.2134090
Denil M, Shakibi B, Dinh L, et al., 2013. Predicting parameters in deep learning. 26th Int Conf on Neural Information Processing Systems, p.2148–2156.
Duchi J, Hazan E, Singer Y, 2011. Adaptive subgradient methods for online learning and stochastic optimization. JMachLearnRes, 12:2121–2159.
Goto K, van de Geijn RA, 2008. Anatomy of highperformance matrix multiplication. ACM Trans Mat Softw, 34(3), Article 12. https://doi.org/10.1145/1356052.1356053
Gupta S, Agrawal A, Gopalakrishnan K, et al., 2015. Deep learning with limited numerical precision. Proc 32nd Int Conf on Machine Learning, p.1737–1746.
Hammarlund P, Martinez AJ, Bajwa AA, et al., 2014. Haswell: the fourth-generation Intel core processor. IEEE Micro, 34(2):6–20. https://doi.org/10.1109/MM.2014.10
Han S, Pool J, Tran J, et al., 2015. Learning both weights and connections for efficient neural network. Proc 28th Int Conf on Neural Information Processing Systems, p.1135–1143.
Han S, Kang JL, Mao HZ, et al., 2017. ESE: efficient speech recognition engine with sparse LSTM on FPGA. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.75–84. https://doi.org/10.1145/3020078.3021745
He TX, Fan YC, Qian YM, et al., 2014. Reshaping deep neural network for fast decoding by node-pruning. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.245–249. https://doi.org/10.1109/ICASSP.2014.6853595
Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82–97. https://doi.org/10.1109/msp.2012.2205597
Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531
Hubara I, Courbariaux M, Soudry D, et al., 2016. Quantized neural networks: training neural networks with low precision weights and activations. https://arxiv.org/abs/1609.07061
Ioffe S, Szegedy C, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. 32nd Int Conf on Machine Learning, p.448–456.
Jaitly N, Nguyen P, Senior A, et al., 2012. Application of pretrained deep neural networks to large vocabulary speech recognition. Proc 13th Annual Conf of the Int Speech Communication Association.
Kingma D, Ba J, 2014. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980
Li JY, Seltzer ML, Wang X, et al., 2017. Large-scale domain adaptation via teacher-student learning. Proc 18th Annual Conf of Int Speech Communication Association, p.2386–2390. https://doi.org/10.21437/Interspeech.2017-519
Low TM, Igual FD, Smith TM, et al., 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans Math Softw, 43(2), Article 12. https://doi.org/10.1145/2925987
Lu L, Renals S, 2017. Small-footprint highway deep neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 25(7):1502–1511. https://doi.org/10.1109/TASLP.2017.2698723
Lu L, Guo M, Renals S, 2017. Knowledge distillation for small-footprint highway networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4820–4824. https://doi.org/10.1109/ICASSP.2017.7953072
Mohamed AR, Dahl GE, Hinton GE, 2012. Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process, 20(1):14–22. https://doi.org/10.1109/TASL.2011.2109382
Novikov A, Podoprikhin D, Osokin A, et al., 2015. Tensorizing neural networks. Advances in Neural Information Processing Systems, p.442–450.
Povey D, Ghoshal A, Boulianne G, et al., 2011. The Kaldi speech recognition toolkit. Proc IEEE Workshop on Automatic Speech Recognition and Understanding.
Qian YM, Woodland PC, 2016. Very deep convolutional neural networks for robust speech recognition. Proc IEEE Spoken Language Technology Workshop, p.481–488. https://doi.org/10.1109/SLT.2016.7846307
Qian YM, He TX, Deng W, et al., 2015. Automatic model redundancy reduction for fast back-propagation for deep neural networks in speech recognition. Proc Int Joint Conf on Neural Networks, p.1–6. https://doi.org/10.1109/IJCNN.2015.7280335
Qian YM, Bi MX, Tan T, et al., 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 24(12):2263–2276. https://doi.org/10.1109/TASLP.2016.2602884
Rastegari M, Ordonez V, Redmon J, et al., 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. Proc 14th European Conf on Computer Vision, p.525–542. https://doi.org/10.1007/978-3-319-46493-0_32
Sainath TN, Mohamed AR, Kingsbury B, et al., 2013. Deep convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8614–8618. https://doi.org/10.1109/ICASSP.2013.6639347
Sak H, Senior A, Beaufays F, 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Proc 15th Annual Conf of Int Speech Communication Association, p.338–342.
Saon G, Kurata G, Sercu T, et al., 2017. English conversational telephone speech recognition by humans and machines. https://arxiv.org/abs/1703.02136
Sercu T, Puhrsch C, Kingsbury B, et al., 2016. Very deep multilingual convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4955–4959. https://doi.org/10.1109/icassp.2016.7472620
Wang YQ, Li JY, Gong YF, 2015. Small-footprint highperformance deep neural network-based speech recognition using split-VQ. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4984–4988. https://doi.org/10.1109/ICASSP.2015.7178919
Xiong W, Droppo J, Huang X, et al., 2016. Achieving human parity in conversational speech recognition. https://arxiv.org/abs/1610.05256
Xiong W, Droppo J, Huang X, et al., 2017. The Microsoft 2016 conversational speech recognition system. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.5255–5259. https://doi.org/10.1109/icassp.2017.7953159
Xue J, Li JY, Gong YF, 2013. Restructuring of deep neural network acoustic models with singular value decomposition. Proc 14th Annual Conf of Int Speech Communication Association, p.2365–2369.
Young S, Evermann G, Gales M, et al., 2006. The HTK Book. Cambridge University Engineering Department, Cambridge, UK.
Yu D, Seide F, Li G, et al., 2012. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4409–4412. https://doi.org/10.1109/ICASSP.2012.6288897
Yu D, Xiong W, Droppo J, et al., 2016. Deep convolutional neural networks with layer-wise context expansion and attention. Proc 17th Annual Conf of Int Speech Communication Association, p.17–21. https://doi.org/10.21437/Interspeech.2016-251
Zhou SC, Wu YX, Ni ZK, et al., 2016. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. https://arxiv.org/abs/1606.06160
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Natural Science Foundation of China (Nos. 61603252 and U1736202) and experiments have been carried out on the Pi supercomputer at Shanghai Jiao Tong University
A preliminary version was presented at the 18th Annual Conference of the International Speech Communication Association, August 20–24, 2017, Sweden
Rights and permissions
About this article
Cite this article
Qian, Ym., Xiang, X. Binary neural networks for speech recognition. Frontiers Inf Technol Electronic Eng 20, 701–715 (2019). https://doi.org/10.1631/FITEE.1800469
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1800469
Key words
- Speech recognition
- Binary neural networks
- Binary matrix multiplication
- Knowledge distillation
- Population count