Skip to main content

Deep Neural Network Quantizers Outperforming Continuous Speech Recognition Systems

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Abstract

In Automatic Speech Recognition (ASR), the acoustic model (AM) is modeled by a Deep Neural Network (DNN). The DNN learns a posterior probability in a supervised fashion utilizing input features and ground-truth labels. Current approaches combine a DNN with a Hidden Markov Model (HMM) in a hybrid approach, which achieved good results in the last years. Similar approaches using a discrete version, hence a Discrete Hidden Markov Model (DHMM), have been disregarded in recent past. Our approach revisits the idea of a discrete system, more precisely the so-called Deep Neural Network Quantizer (DNNQ), demonstrating how a DNNQ is created and trained. We introduce a novel approach to train a DNNQ in a supervised fashion with an arbitrary output layer size even though suitable target values are not available. The proposed method provides a mapping function exploiting fixed ground-truth labels. Consequently, we are able to apply a frame-based cross entropy (CE) training. Our experiments demonstrate that the DNNQ reduces the Word Error Rate (WER) by 17.6 % on monophones and by 2.2 % on triphones, respectively, compared to a continuous HMM-Gaussian Mixture Model (GMM) system.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)

    Google Scholar 

  2. Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer, New York (2012)

    Google Scholar 

  3. Carletta, J., et al.: The AMI meeting corpus: a pre-announcement. In: International Workshop on Machine Learning for Multimodal Interaction, pp. 28–39. Springer (2005). https://doi.org/10.1007/11677482_3

  4. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

  5. Kanda, N., Fujita, Y., Nagamatsu, K.: Lattice-free state-level minimum Bayes risk training of acoustic models. In: Proceedings of the INTERSPEECH (2018)

    Google Scholar 

  6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  7. Neukirchen, C., Rigoll, G.: Advanced training methods and new network topologies for hybrid MMI-connectionist/HMM speech recognition systems. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 3257–3260. IEEE (1997)

    Google Scholar 

  8. Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362. Association for Computational Linguistics (1992)

    Google Scholar 

  9. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. EPFL-CONF-192584. IEEE Signal Processing Society (2011)

    Google Scholar 

  10. Povey, D., et al.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: INTERSPEECH, pp. 2751–2755 (2016)

    Google Scholar 

  11. Price, P., Fisher, W.M., Bernstein, J., Pallett, D.S.: The DARPA 1000-word resource management database for continuous speech recognition. In: 1988 International Conference on Acoustics, Speech, and Signal Processing, pp. 651–654. IEEE (1988)

    Google Scholar 

  12. Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: INTERSPEECH, pp. 109–113 (2013)

    Google Scholar 

  13. Rigoll, G., Neukirchen, C., Rottland, J.: A new hybrid system based on MMI-neural networks for the RM speech recognition task. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 865–868. IEEE (1996)

    Google Scholar 

  14. Rottland, J., Neukirchen, C., Willett, D., Rigoll, G.: Large vocabulary speech recognition with context dependent MMI-connectionist/HMM systems using the WSJ database. In: Fifth European Conference on Speech Communication and Technology (1997)

    Google Scholar 

  15. Rousseau, A., Deléglise, P., Esteve, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: LREC, pp. 3935–3939 (2014)

    Google Scholar 

  16. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  17. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  18. Veselỳ, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: INTERSPEECH, pp. 2345–2349 (2013)

    Google Scholar 

  19. Xiong, W., et al.: Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tobias Watzel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Watzel, T., Li, L., Kürzinger, L., Rigoll, G. (2019). Deep Neural Network Quantizers Outperforming Continuous Speech Recognition Systems. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26061-3_54

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26060-6

  • Online ISBN: 978-3-030-26061-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics