Abstract
This paper provides an overview of recent approaches to deep learning as applied to speech processing tasks, primarily for automatic speech recognition, but also text-to-speech and speaker, language and emotion recognition. The focus is on efficient methods, addressing issues of accuracy, computation, storage, and delay. The discussion puts the speech processing tasks in the broader context of pattern recognition, comparing with signals other than speech. It also compares machine learning with other recent methods of speech analysis, e.g., hidden Markov models. The paper emphasizes a thorough understanding of the choices made in analyzing and interpreting speech signals. It minimizes use of mathematics and is aimed at non-experts; the references provide needed detail for those interested.
Similar content being viewed by others
References
K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, D. Nahamoo, Direct acoustics-to-word models for English conversational speech recognition, in Interspeech (2017), pp. 959–963
A. Avila, J. Monteiro, D. O’Shaughnessy, T. Falk, Speech emotion recognition on mobile devices based on modulation spectral feature pooling and deep neural networks, in IEEE ISSPIT (2017)
T. Backstrom, Speech Coding: With Code-Excited Linear Prediction (Springer, Berlin, 2017)
L. Bai, P. Weber, P. Jancovic, M. Russell, Exploring how phone classification neural networks learn phonetic information by visualising and interpreting bottleneck features, in Interspeech (2018), pp. 1472–1476
Y. Bengio, A.C. Courville, P. Vincent, Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
C. Bishop, Pattern Recognition and Machine Learning (Springer, Berlin, 2006)
C.-C. Chiu, et al., State-of-the-art speech recognition with sequence-to-sequence models, in ICASSP (2017)
J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Yoshua Bengio, Attention-based models for speech recognition, in NIPS (2015), pp. 1–16
R. Collobert, C. Puhrsch, G. Synnaeve, Wav2Letter: an end-to-end ConvNet-based speech recognition system. arXiv:1609.03193 (2016)
S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP 28, 357–366 (1980)
H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in ICASSP (2015), pp. 708–712
E. Fosler-Lussier, Y. He, P. Jyothi, R. Prabhavalkar, Conditional random fields in speech, audio, and language processing. Proc. IEEE 101, 1054–1075 (2013)
P. Ghahremani, H. Hadian, H. Lv, D. Povey, S. Khudanpur, Acoustic modeling from frequency domain representations of speech, in Interspeech (2018), pp. 1596–1600
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
A. Graves, A. Mohamed, G. Geoffrey Hinton, Speech recognition with deep recurrent neural networks, in ICASSP (2013), pp. 6645–6649
A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in International Conference on Machine Learning, Pittsburgh, PA (2006)
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012)
X.D. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Prentice Hall, Englewood Cliffs, 2001)
Y. Huang, A. Sethy, B. Ramabhadran, Fast neural network language model lookups at N-gram speed, in Interspeech (2017), pp. 274–278
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18(187), 1–30 (2017)
I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 2002)
M. Jordan, E. Sudderth, M. Wainwright, A. Wilsky, Major advances and emerging developments of graphical models. IEEE Signal Process. Mag. 27(6), 17–138 (2010)
S. Kullback, R.A. Leibler, On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
M. Kutner, J. Neter, C. Nachtsheim, W. Wasserman, Applied Linear Statistical Models (McGraw-Hill, New York, 2004)
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015)
B. Li, et al., Acoustic modeling for Google home, in Interspeech (2017), pp. 399–403
W. Li, G. Cheng, F. Ge, P. Zhang, Y. Yan, Investigation on the combination of batch normalization and dropout in BLSTM-based acoustic modeling for ASR, in Interspeech (2018), pp. 2888–2492
L. Lu, L. Kong, C. Dyer, N.A. Smith, S. Renals, Segmental recurrent neural networks for end-to-end speech recognition, in Interspeech (2016), pp. 385–389
S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, Y. Bengio, SampleRNN: an unconditional end-to-end neural audio generation model. arXiv:1612.07837 (2016)
S.K. Moore, IBM’s new do-it-all AI chip. IEEE Spectrum, August, pp. 10–11 (2018)
K. Mustafa, I.C. Bruce, Robust formant tracking for continuous speech with speaker variability. IEEE Trans. Audio Speech Lang. Process. 14, 2 (2006)
T. Nagamine, M.L. Seltzer, N. Mesgarani, Exploring how deep neural networks form phonemic categories, in Interspeech (2015), pp. 1912–1916
T. Nagamine, M.L. Seltzer, N. Mesgarani, On the role of nonlinear transformations in deep neural network acoustic models, in Interspeech (2016), pp. 803–807
T. Nagamine, N. Mesgarani, Understanding the representation and computation of multilayer perceptrons: a case study in speech recognition, in International Conference on Machine Learning, Sydney, Australia, PMLR 70 (2017)
M. Nussbaum-Thom, J. Cui, B. Ramabhadran, V. Goel, Acoustic modeling using bidirectional gated recurrent convolutional units, in Interspeech (2016), pp. 390–394
D. O’Shaughnessy, Speech Communications: Human and Machine (IEEE Press, New York, 2000)
D. O’Shaughnessy, Automatic speech recognition: history, methods and challenges. Pattern Recogn. 41, 2965–2979 (2008)
D. O’Shaughnessy, Interacting with computers by voice: automatic speech recognition and synthesis. IEEE Proc. 91, 1272–1305 (2003)
W. Ping, K. Peng, A. Gibiansky, S.O. Arık, A. Kannan, S. Narang, Deep voice 3: scaling text-to-speech with convolutional sequence learning, in ICLR (2018)
R. Prabhavalkar, T.N. Sainath, B. Li, K. Rao, N. Jaitly, An analysis of “attention” in sequence-to-sequence models, in Interspeech (2017), pp. 3702–3706
Y. Qian, M. Bi, T. Tan, K. Yu, Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 2263–2276 (2016)
L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)
M. Ratajczak, S. Tschiatschek, F. Pernkopf, Frame and segment level recurrent neural networks for phone classification, in Interspeech (2017), pp. 1318–1322
T.N. Sainath, B. Li, Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks, in Interspeech (2016), pp. 813–817
G. Saon, et al., English conversational telephone speech recognition by humans and machines, in Interspeech (2017), pp. 132–136
J. Sotelo, S. Mehri, K. Kumar, J.F. Santosy, K. Kastner, A. Courvillez, Y. Bengio, Char2wav: End-to-end speech synthesis, in ICLR (2017)
S. Sun, C.-F. Yeh, M. Ostendorf, M.-Y.Hwang, L. Xie, Training augmentation with adversarial examples for robust speech recognition, in Interspeech (2018), pp. 2404–2408
W. Sun, F. Su, L. Wang, Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 278, 34–40 (2018)
I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (NIPS) (2014), pp. 3104–3112
L. ten Bosch, L. Boves, Information encoding by deep neural networks: what can we learn? in Interspeech (2018), pp. 1457–1461
A. Tjandra, S. Sakti, S. Nakamura, Sequence-to-sequence ASR optimization via reinforcement learning, in ICASSP (2018)
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, WaveNet: a generative model for raw audio. arXiv:1609.03499 (2016)
Y. Wang, et al., Tacotron: towards end-to-end speech synthesis, in Interspeech (2017), pp. 4006–4010
W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, A. Stolcke, The Microsoft 2017 conversational speech recognition system, in ICASSP (2018)
Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent, Y. Bengio, A. Courville, Towards end-to-end speech recognition with deep convolutional neural networks, in Interspeech (2016), pp. 410–414
Z. Zhang, J. Geiger, J. Pohjalainen, A. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: an overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 49 (2018)
Acknowledgements
This work was funded by NSERC (Canada) (Grant No. 142610). I wish to thank Michael Picheny and Tiago Falk for their comments to help improve the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
O’Shaughnessy, D. Recognition and Processing of Speech Signals Using Neural Networks. Circuits Syst Signal Process 38, 3454–3481 (2019). https://doi.org/10.1007/s00034-019-01081-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-019-01081-6