Skip to main content
Log in

Improving the Decoding Efficiency of Deep Neural Network Acoustic Models by Cluster-Based Senone Selection

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

In this paper, we propose a cluster-based senone selection method to speed up the computation of deep neural networks (DNN) at the decoding time of automatic speech recognition (ASR) systems. In DNN-based acoustic models, the large number of senones at the output layer is one of the main causes that lead to the high computation complexity of DNNs. Inspired by the mixture selection method designed for the Gaussian mixture model (GMM)-based acoustic models, only a subset of the senones at the output layer of DNNs are selected to calculate the posterior probabilities in our proposed method. The senone selection strategy is derived by clustering acoustic features according to their transformed representations at the top hidden layer of the DNN acoustic model. Experimental results on Mandarin speech recognition tasks show that the average number of DNN parameters used for computation can be reduced by 22% and the overall speed of the recognition process can be accelerated by 13% without significant performance degradation after using our proposed method. Experimental results on the Switchboard task demonstrate that our proposed method can reduce the average number of DNN parameters used for computation by 38.8% for conventional DNN modeling and 22.7% for low-rank DNN modeling respectively with negligible performance loss.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

References

  1. Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  2. HHinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–91.

  3. Yu, D., & Deng, L. (2015). Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated.

  4. Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-training deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio Speech, and Language Processing, 20(1), 30–42.

    Article  Google Scholar 

  5. Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., & Schalkwyk, J. (2015). Learning acoustic frame labeling for speech recognition with recurrent neural networks. In Proceedings of ICASSP (pp. 4280–4284). Brisbane, Australia.

  6. Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of Interspeech (pp. 1468–1472). Dresden, Germany.

  7. Deng, L., Li, J., Ting Huang, J., Yao, K., Yu, D., Seide, F., Seltzer, M.L., Zweig, G., He, X., Williams, J., Gong, Y., & Acero, A. (2013). Recent advances in deep learning for speech research at microsoft. In Proceedings of ICASSP (pp. 8604–8609).

  8. Schmidhuber, J. (2015). Deep learning in neural networks: an overview. Neural Networks, 61, 85–117.

    Article  Google Scholar 

  9. Yu, D., Seide, F., Li, G., & Deng, L. (2012). Exploting sparseness in deep neural networks for large vocabulary speech recognition. In Proceedings of ICASSP (pp. 4409–4412).

  10. Lei, X., Senior, A., Gruenstein, A., & Sorensen, J. (2013). Accurate and compact large vocabulary speech recognition on mobile devices. In Proceedings of Interspeech (pp. 662–665).

  11. Li, J., Zhao, R., Huang, J.T., & Gong, Y. (2014). Learning small-size DNN with output-distribution-based criteria. In Proceedings of Interspeech (pp. 1910–1914).

  12. Vanhoucke, V., Devin, M., & Heigold, G. (2013). Multiframe deep neural networks for acoustic modeling. In Proceedings of ICASSP.

  13. Xiao, Y., Si, Y., Xu, J., Pan, J., & Yan, Y. (2014). Speeding up deep neural network based speech recognition systems. Journal of Software, 9(10), 2706–2712.

    Article  Google Scholar 

  14. Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., & Ramabhadran, B. (2013). Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proceeding of ICASSP (pp. 6655–6659).

  15. Xue, J., Li, J., & Gong, Y. (2013). Restructuring of deep neural network acoustic models with singular value decomposition. In Proceedings of Interspeech.

  16. Cheng, Y., Yu, F.X., Feris, R.S., Kumar, S., Choudhary, A., & Chang, S.F. (2015). An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE international conference on computer vision (pp. 2857–2865).

  17. He, T., Fan, Y., Qian, Y., Tan, T., & Yu, K. (2014). Reshaping deep neural network for fast decoding by node-pruning. In Proceedings of ICASSP (pp. 245–249).

  18. Tu, M., Berisha, V., Woolf, M., sun Seo, J., & Cao, Y. (2016). Ranking the parameters of deep neural network using the fisher information. In Proceedings of ICASSP.

  19. Vanhoucke, V., & Senior, A. (2011). Improving the speed of neural networkds on CPUs. In Deep learning and unsupervised feature learning workshop, NIPS 2011 (pp. 272–281).

  20. Zhou, P., Jiang, H., Dai, L.R., Hu, Y., & Liu, Q.F. (2015). State-clustering based multiple deep neural networks modeling approach for speech recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 23(4), 631–642.

    Article  Google Scholar 

  21. Zhu, Y., & Mak, B. (2017). Speeding up softmax computations in dnn-based large vocabulary speech recognition by senone weight vector selection. In Proceedings of ICASSP (pp. 5335–5339).

  22. Chan, A., Sherwani, J., Mosur, R., & Rudnicky, A. (2004). Four-layer categorization scheme of fast GMM computation techniques in large vocabulary continuous speech recognition systems. In Proceedings of ICSLP.

  23. Lee, A., Kawahara, T., & Shikano, K. (2001). Gaussian mixture selection using context-independent HMM. In Proceedings of ICASSP, (Vol. 1 pp. 69–72). Salt Lake, UT.

  24. Bocchieri, E. (1993). Vector quantization for the efficient computation of continuous density likelihoods. In Proceedings of ICASSP, (Vol. 2 pp. 692–695). Minneapolis, MN, USA.

  25. Zhang, C., Zheng, R., & Xu, B. (2011). Data-driven Gaussian component selection for fast GMM-based speaker verification. In Proceedings of Interspeech (pp. 245–248).

  26. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J., Ollason, D., & Povey, D. (2006). The HTK book (for HTK version 3.4).

  27. Rahman Mohamed, A., Dahl, G.E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio Speech, and Language Processing, 20(1), 14–22.

    Article  Google Scholar 

  28. Mohri, M., Pereira, F., & Riley, M. (2008). Speech recognition with weighted finite-state transducers. In Springer handbook of speech processing (pp. 559–584). Springer.

  29. Hawkins, J., & Blakeslee, S. (2007). On intelligence. New York: Times Books.

    Google Scholar 

  30. Reynolds, D.A., Quatieri, T.F., & Dunn, R.B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41.

    Article  Google Scholar 

  31. Karakos, D., Schwartz, R., Tsakalidis, S., & Le, Z. (2013). Score normalization and system combination for improved keyword spotting. In Automatic speech recognition and understanding (pp. 210–215).

  32. Chiu, J., Wang, Y., Trmal, J., Povey, D., Chen, G., & Rudnicky, A.I. (2014). Combination of fst and cn search in spoken term detection. In Proceedings of Interspeech (p. 2784).

  33. Godfrey, J.J., Holliman, E.C., & Mcdaniel, J. (1992). Switchboard: telephone speech corpus for research and development. In IEEE international conference on acoustics, speech and signal processing, (Vol.1 pp. 517–520).

  34. Powers, D.M. (2011). Evaluation: from precision, recall and f-meansure to roc, informedness, markedeness & correlation. Journal of Machine Learning Technologies, 2(1), 37–63.

    MathSciNet  Google Scholar 

  35. David, C.C., Miller, D., & Walker, K. (2004). The fisher corpus: a resource for the next generations of speech-to. In International conference on language resources & evaluation (pp. 69–71).

Download references

Acknowledgements

This work was partly supported by the National key research and development program(Grant No. 2016YFB1001300), the Fundamental Research Funds for the Central Universities (Grant No. WK2350000001), and the CAS Strategic Priority Research Program (Grant No. XDB02070006).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun-Hua Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, JH., Ling, ZH., Wei, S. et al. Improving the Decoding Efficiency of Deep Neural Network Acoustic Models by Cluster-Based Senone Selection. J Sign Process Syst 90, 999–1011 (2018). https://doi.org/10.1007/s11265-017-1288-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-017-1288-9

Keywords

Navigation