Skip to main content
Log in

Recognition and Processing of Speech Signals Using Neural Networks

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper provides an overview of recent approaches to deep learning as applied to speech processing tasks, primarily for automatic speech recognition, but also text-to-speech and speaker, language and emotion recognition. The focus is on efficient methods, addressing issues of accuracy, computation, storage, and delay. The discussion puts the speech processing tasks in the broader context of pattern recognition, comparing with signals other than speech. It also compares machine learning with other recent methods of speech analysis, e.g., hidden Markov models. The paper emphasizes a thorough understanding of the choices made in analyzing and interpreting speech signals. It minimizes use of mathematics and is aimed at non-experts; the references provide needed detail for those interested.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, D. Nahamoo, Direct acoustics-to-word models for English conversational speech recognition, in Interspeech (2017), pp. 959–963

  2. A. Avila, J. Monteiro, D. O’Shaughnessy, T. Falk, Speech emotion recognition on mobile devices based on modulation spectral feature pooling and deep neural networks, in IEEE ISSPIT (2017)

  3. T. Backstrom, Speech Coding: With Code-Excited Linear Prediction (Springer, Berlin, 2017)

    Book  Google Scholar 

  4. L. Bai, P. Weber, P. Jancovic, M. Russell, Exploring how phone classification neural networks learn phonetic information by visualising and interpreting bottleneck features, in Interspeech (2018), pp. 1472–1476

  5. Y. Bengio, A.C. Courville, P. Vincent, Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)

    Article  Google Scholar 

  6. C. Bishop, Pattern Recognition and Machine Learning (Springer, Berlin, 2006)

    MATH  Google Scholar 

  7. C.-C. Chiu, et al., State-of-the-art speech recognition with sequence-to-sequence models, in ICASSP (2017)

  8. J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Yoshua Bengio, Attention-based models for speech recognition, in NIPS (2015), pp. 1–16

  9. R. Collobert, C. Puhrsch, G. Synnaeve, Wav2Letter: an end-to-end ConvNet-based speech recognition system. arXiv:1609.03193 (2016)

  10. S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP 28, 357–366 (1980)

    Article  Google Scholar 

  11. H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in ICASSP (2015), pp. 708–712

  12. E. Fosler-Lussier, Y. He, P. Jyothi, R. Prabhavalkar, Conditional random fields in speech, audio, and language processing. Proc. IEEE 101, 1054–1075 (2013)

    Article  Google Scholar 

  13. P. Ghahremani, H. Hadian, H. Lv, D. Povey, S. Khudanpur, Acoustic modeling from frequency domain representations of speech, in Interspeech (2018), pp. 1596–1600

  14. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)

    MATH  Google Scholar 

  15. A. Graves, A. Mohamed, G. Geoffrey Hinton, Speech recognition with deep recurrent neural networks, in ICASSP (2013), pp. 6645–6649

  16. A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in International Conference on Machine Learning, Pittsburgh, PA (2006)

  17. G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012)

    Article  Google Scholar 

  18. X.D. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Prentice Hall, Englewood Cliffs, 2001)

    Google Scholar 

  19. Y. Huang, A. Sethy, B. Ramabhadran, Fast neural network language model lookups at N-gram speed, in Interspeech (2017), pp. 274–278

  20. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18(187), 1–30 (2017)

    MathSciNet  MATH  Google Scholar 

  21. I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 2002)

    MATH  Google Scholar 

  22. M. Jordan, E. Sudderth, M. Wainwright, A. Wilsky, Major advances and emerging developments of graphical models. IEEE Signal Process. Mag. 27(6), 17–138 (2010)

    Article  Google Scholar 

  23. S. Kullback, R.A. Leibler, On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  24. M. Kutner, J. Neter, C. Nachtsheim, W. Wasserman, Applied Linear Statistical Models (McGraw-Hill, New York, 2004)

    Google Scholar 

  25. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015)

    Article  Google Scholar 

  26. B. Li, et al., Acoustic modeling for Google home, in Interspeech (2017), pp. 399–403

  27. W. Li, G. Cheng, F. Ge, P. Zhang, Y. Yan, Investigation on the combination of batch normalization and dropout in BLSTM-based acoustic modeling for ASR, in Interspeech (2018), pp. 2888–2492

  28. L. Lu, L. Kong, C. Dyer, N.A. Smith, S. Renals, Segmental recurrent neural networks for end-to-end speech recognition, in Interspeech (2016), pp. 385–389

  29. S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, Y. Bengio, SampleRNN: an unconditional end-to-end neural audio generation model. arXiv:1612.07837 (2016)

  30. S.K. Moore, IBM’s new do-it-all AI chip. IEEE Spectrum, August, pp. 10–11 (2018)

  31. K. Mustafa, I.C. Bruce, Robust formant tracking for continuous speech with speaker variability. IEEE Trans. Audio Speech Lang. Process. 14, 2 (2006)

    Article  Google Scholar 

  32. T. Nagamine, M.L. Seltzer, N. Mesgarani, Exploring how deep neural networks form phonemic categories, in Interspeech (2015), pp. 1912–1916

  33. T. Nagamine, M.L. Seltzer, N. Mesgarani, On the role of nonlinear transformations in deep neural network acoustic models, in Interspeech (2016), pp. 803–807

  34. T. Nagamine, N. Mesgarani, Understanding the representation and computation of multilayer perceptrons: a case study in speech recognition, in International Conference on Machine Learning, Sydney, Australia, PMLR 70 (2017)

  35. M. Nussbaum-Thom, J. Cui, B. Ramabhadran, V. Goel, Acoustic modeling using bidirectional gated recurrent convolutional units, in Interspeech (2016), pp. 390–394

  36. D. O’Shaughnessy, Speech Communications: Human and Machine (IEEE Press, New York, 2000)

    MATH  Google Scholar 

  37. D. O’Shaughnessy, Automatic speech recognition: history, methods and challenges. Pattern Recogn. 41, 2965–2979 (2008)

    Article  MATH  Google Scholar 

  38. D. O’Shaughnessy, Interacting with computers by voice: automatic speech recognition and synthesis. IEEE Proc. 91, 1272–1305 (2003)

    Article  Google Scholar 

  39. W. Ping, K. Peng, A. Gibiansky, S.O. Arık, A. Kannan, S. Narang, Deep voice 3: scaling text-to-speech with convolutional sequence learning, in ICLR (2018)

  40. R. Prabhavalkar, T.N. Sainath, B. Li, K. Rao, N. Jaitly, An analysis of “attention” in sequence-to-sequence models, in Interspeech (2017), pp. 3702–3706

  41. Y. Qian, M. Bi, T. Tan, K. Yu, Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 2263–2276 (2016)

    Article  Google Scholar 

  42. L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)

    Google Scholar 

  43. M. Ratajczak, S. Tschiatschek, F. Pernkopf, Frame and segment level recurrent neural networks for phone classification, in Interspeech (2017), pp. 1318–1322

  44. T.N. Sainath, B. Li, Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks, in Interspeech (2016), pp. 813–817

  45. G. Saon, et al., English conversational telephone speech recognition by humans and machines, in Interspeech (2017), pp. 132–136

  46. J. Sotelo, S. Mehri, K. Kumar, J.F. Santosy, K. Kastner, A. Courvillez, Y. Bengio, Char2wav: End-to-end speech synthesis, in ICLR (2017)

  47. S. Sun, C.-F. Yeh, M. Ostendorf, M.-Y.Hwang, L. Xie, Training augmentation with adversarial examples for robust speech recognition, in Interspeech (2018), pp. 2404–2408

  48. W. Sun, F. Su, L. Wang, Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 278, 34–40 (2018)

    Article  Google Scholar 

  49. I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (NIPS) (2014), pp. 3104–3112

  50. L. ten Bosch, L. Boves, Information encoding by deep neural networks: what can we learn? in Interspeech (2018), pp. 1457–1461

  51. A. Tjandra, S. Sakti, S. Nakamura, Sequence-to-sequence ASR optimization via reinforcement learning, in ICASSP (2018)

  52. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, WaveNet: a generative model for raw audio. arXiv:1609.03499 (2016)

  53. Y. Wang, et al., Tacotron: towards end-to-end speech synthesis, in Interspeech (2017), pp. 4006–4010

  54. W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, A. Stolcke, The Microsoft 2017 conversational speech recognition system, in ICASSP (2018)

  55. Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent, Y. Bengio, A. Courville, Towards end-to-end speech recognition with deep convolutional neural networks, in Interspeech (2016), pp. 410–414

  56. Z. Zhang, J. Geiger, J. Pohjalainen, A. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: an overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 49 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

This work was funded by NSERC (Canada) (Grant No. 142610). I wish to thank Michael Picheny and Tiago Falk for their comments to help improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Douglas O’Shaughnessy.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

O’Shaughnessy, D. Recognition and Processing of Speech Signals Using Neural Networks. Circuits Syst Signal Process 38, 3454–3481 (2019). https://doi.org/10.1007/s00034-019-01081-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-019-01081-6

Keywords

Navigation