Recognition and Processing of Speech Signals Using Neural Networks

O’Shaughnessy, Douglas

doi:10.1007/s00034-019-01081-6

Recognition and Processing of Speech Signals Using Neural Networks

Published: 07 March 2019

Volume 38, pages 3454–3481, (2019)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Douglas O’Shaughnessy ORCID: orcid.org/0000-0002-0110-2346¹

879 Accesses
7 Citations
Explore all metrics

Abstract

This paper provides an overview of recent approaches to deep learning as applied to speech processing tasks, primarily for automatic speech recognition, but also text-to-speech and speaker, language and emotion recognition. The focus is on efficient methods, addressing issues of accuracy, computation, storage, and delay. The discussion puts the speech processing tasks in the broader context of pattern recognition, comparing with signals other than speech. It also compares machine learning with other recent methods of speech analysis, e.g., hidden Markov models. The paper emphasizes a thorough understanding of the choices made in analyzing and interpreting speech signals. It minimizes use of mathematics and is aimed at non-experts; the references provide needed detail for those interested.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

Iqbal H. Sarker

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

Asifullah Khan, Anabia Sohail, … Aqsa Saeed Qureshi

Identity Mappings in Deep Residual Networks

References

K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, D. Nahamoo, Direct acoustics-to-word models for English conversational speech recognition, in Interspeech (2017), pp. 959–963
A. Avila, J. Monteiro, D. O’Shaughnessy, T. Falk, Speech emotion recognition on mobile devices based on modulation spectral feature pooling and deep neural networks, in IEEE ISSPIT (2017)
T. Backstrom, Speech Coding: With Code-Excited Linear Prediction (Springer, Berlin, 2017)
Book Google Scholar
L. Bai, P. Weber, P. Jancovic, M. Russell, Exploring how phone classification neural networks learn phonetic information by visualising and interpreting bottleneck features, in Interspeech (2018), pp. 1472–1476
Y. Bengio, A.C. Courville, P. Vincent, Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Article Google Scholar
C. Bishop, Pattern Recognition and Machine Learning (Springer, Berlin, 2006)
MATH Google Scholar
C.-C. Chiu, et al., State-of-the-art speech recognition with sequence-to-sequence models, in ICASSP (2017)
J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Yoshua Bengio, Attention-based models for speech recognition, in NIPS (2015), pp. 1–16
R. Collobert, C. Puhrsch, G. Synnaeve, Wav2Letter: an end-to-end ConvNet-based speech recognition system. arXiv:1609.03193 (2016)
S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP 28, 357–366 (1980)
Article Google Scholar
H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in ICASSP (2015), pp. 708–712
E. Fosler-Lussier, Y. He, P. Jyothi, R. Prabhavalkar, Conditional random fields in speech, audio, and language processing. Proc. IEEE 101, 1054–1075 (2013)
Article Google Scholar
P. Ghahremani, H. Hadian, H. Lv, D. Povey, S. Khudanpur, Acoustic modeling from frequency domain representations of speech, in Interspeech (2018), pp. 1596–1600
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
MATH Google Scholar
A. Graves, A. Mohamed, G. Geoffrey Hinton, Speech recognition with deep recurrent neural networks, in ICASSP (2013), pp. 6645–6649
A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in International Conference on Machine Learning, Pittsburgh, PA (2006)
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012)
Article Google Scholar
X.D. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Prentice Hall, Englewood Cliffs, 2001)
Google Scholar
Y. Huang, A. Sethy, B. Ramabhadran, Fast neural network language model lookups at N-gram speed, in Interspeech (2017), pp. 274–278
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18(187), 1–30 (2017)
MathSciNet MATH Google Scholar
I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 2002)
MATH Google Scholar
M. Jordan, E. Sudderth, M. Wainwright, A. Wilsky, Major advances and emerging developments of graphical models. IEEE Signal Process. Mag. 27(6), 17–138 (2010)
Article Google Scholar
S. Kullback, R.A. Leibler, On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
M. Kutner, J. Neter, C. Nachtsheim, W. Wasserman, Applied Linear Statistical Models (McGraw-Hill, New York, 2004)
Google Scholar
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015)
Article Google Scholar
B. Li, et al., Acoustic modeling for Google home, in Interspeech (2017), pp. 399–403
W. Li, G. Cheng, F. Ge, P. Zhang, Y. Yan, Investigation on the combination of batch normalization and dropout in BLSTM-based acoustic modeling for ASR, in Interspeech (2018), pp. 2888–2492
L. Lu, L. Kong, C. Dyer, N.A. Smith, S. Renals, Segmental recurrent neural networks for end-to-end speech recognition, in Interspeech (2016), pp. 385–389
S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, Y. Bengio, SampleRNN: an unconditional end-to-end neural audio generation model. arXiv:1612.07837 (2016)
S.K. Moore, IBM’s new do-it-all AI chip. IEEE Spectrum, August, pp. 10–11 (2018)
K. Mustafa, I.C. Bruce, Robust formant tracking for continuous speech with speaker variability. IEEE Trans. Audio Speech Lang. Process. 14, 2 (2006)
Article Google Scholar
T. Nagamine, M.L. Seltzer, N. Mesgarani, Exploring how deep neural networks form phonemic categories, in Interspeech (2015), pp. 1912–1916
T. Nagamine, M.L. Seltzer, N. Mesgarani, On the role of nonlinear transformations in deep neural network acoustic models, in Interspeech (2016), pp. 803–807
T. Nagamine, N. Mesgarani, Understanding the representation and computation of multilayer perceptrons: a case study in speech recognition, in International Conference on Machine Learning, Sydney, Australia, PMLR 70 (2017)
M. Nussbaum-Thom, J. Cui, B. Ramabhadran, V. Goel, Acoustic modeling using bidirectional gated recurrent convolutional units, in Interspeech (2016), pp. 390–394
D. O’Shaughnessy, Speech Communications: Human and Machine (IEEE Press, New York, 2000)
MATH Google Scholar
D. O’Shaughnessy, Automatic speech recognition: history, methods and challenges. Pattern Recogn. 41, 2965–2979 (2008)
Article MATH Google Scholar
D. O’Shaughnessy, Interacting with computers by voice: automatic speech recognition and synthesis. IEEE Proc. 91, 1272–1305 (2003)
Article Google Scholar
W. Ping, K. Peng, A. Gibiansky, S.O. Arık, A. Kannan, S. Narang, Deep voice 3: scaling text-to-speech with convolutional sequence learning, in ICLR (2018)
R. Prabhavalkar, T.N. Sainath, B. Li, K. Rao, N. Jaitly, An analysis of “attention” in sequence-to-sequence models, in Interspeech (2017), pp. 3702–3706
Y. Qian, M. Bi, T. Tan, K. Yu, Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 2263–2276 (2016)
Article Google Scholar
L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)
Google Scholar
M. Ratajczak, S. Tschiatschek, F. Pernkopf, Frame and segment level recurrent neural networks for phone classification, in Interspeech (2017), pp. 1318–1322
T.N. Sainath, B. Li, Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks, in Interspeech (2016), pp. 813–817
G. Saon, et al., English conversational telephone speech recognition by humans and machines, in Interspeech (2017), pp. 132–136
J. Sotelo, S. Mehri, K. Kumar, J.F. Santosy, K. Kastner, A. Courvillez, Y. Bengio, Char2wav: End-to-end speech synthesis, in ICLR (2017)
S. Sun, C.-F. Yeh, M. Ostendorf, M.-Y.Hwang, L. Xie, Training augmentation with adversarial examples for robust speech recognition, in Interspeech (2018), pp. 2404–2408
W. Sun, F. Su, L. Wang, Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 278, 34–40 (2018)
Article Google Scholar
I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (NIPS) (2014), pp. 3104–3112
L. ten Bosch, L. Boves, Information encoding by deep neural networks: what can we learn? in Interspeech (2018), pp. 1457–1461
A. Tjandra, S. Sakti, S. Nakamura, Sequence-to-sequence ASR optimization via reinforcement learning, in ICASSP (2018)
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, WaveNet: a generative model for raw audio. arXiv:1609.03499 (2016)
Y. Wang, et al., Tacotron: towards end-to-end speech synthesis, in Interspeech (2017), pp. 4006–4010
W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, A. Stolcke, The Microsoft 2017 conversational speech recognition system, in ICASSP (2018)
Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent, Y. Bengio, A. Courville, Towards end-to-end speech recognition with deep convolutional neural networks, in Interspeech (2016), pp. 410–414
Z. Zhang, J. Geiger, J. Pohjalainen, A. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: an overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 49 (2018)
Article Google Scholar

Download references

Acknowledgements

This work was funded by NSERC (Canada) (Grant No. 142610). I wish to thank Michael Picheny and Tiago Falk for their comments to help improve the paper.

Author information

Authors and Affiliations

INRS-EMT (University of Quebec), Place Bonaventure, Montreal, H5A 1K6, Canada
Douglas O’Shaughnessy

Authors

Douglas O’Shaughnessy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Douglas O’Shaughnessy.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

O’Shaughnessy, D. Recognition and Processing of Speech Signals Using Neural Networks. Circuits Syst Signal Process 38, 3454–3481 (2019). https://doi.org/10.1007/s00034-019-01081-6

Download citation

Received: 01 September 2018
Revised: 22 February 2019
Accepted: 26 February 2019
Published: 07 March 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s00034-019-01081-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recognition and Processing of Speech Signals Using Neural Networks

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

A survey of the recent architectures of deep convolutional neural networks

Identity Mappings in Deep Residual Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Recognition and Processing of Speech Signals Using Neural Networks

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

A survey of the recent architectures of deep convolutional neural networks

Identity Mappings in Deep Residual Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation