Abstract
Robust automatic speech recognition (ASR) technologies have greatly evolved due to the emergence of deep learning. This chapter introduces the general background of robustness issues of deep neural-network-based ASR. It provides an overview of robust ASR research including a brief history of several studies before the deep learning era, basic formulations of ASR, signal processing, and neural networks. This chapter also introduces common notations for variables and equations, which are extended in the later chapters to deal with more advanced topics. Finally, the chapter provides an overview of the book structure by summarizing the contributions of the individual chapters and associates them with the different components of a robust ASR system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The WERs refer to the Kaldi AMI recipe, November 15, 2016. https://github.com/kaldi-asr/kaldi/blob/master/egs/ami/s5b.
- 2.
- 3.
However, these concepts have inspired related techniques for DNN-based acoustic models, such as DNN parameter regularization based on the L2 norm and Kullback–Leibler (KL) divergence, that can be regarded as a variant of MAP adaptation in the context of DNNs.
- 4.
This problem is discussed in Chap. 13
References
Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “CHiME” speech separation and recognition challenge: dataset, task and baselines. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 504–511 (2015)
Berouti, M., Schwartz, R., Makhoul, J.: Enhancement of speech corrupted by acoustic noise. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP’79, vol. 4, pp. 208–211. IEEE, New York (1979)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)
Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., et al.: The AMI meeting corpus: a pre-announcement. In: International Workshop on Machine Learning for Multimodal Interaction, pp. 28–39. Springer, Berlin (2005)
Deng, L., Droppo, J., Acero, A.: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. Speech Audio Process. 13(3), 412–421 (2005)
Digalakis, V.V., Rtischev, D., Neumeyer, L.G.: Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3(5), 357–366 (1995)
Eide, E., Gish, H.: A parametric approach to vocal tract length normalization. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 96, vol. 1, pp. 346–348. IEEE, New York (1996)
ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms. ETSI ES 202, 050 (2002)
Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Gales, M.J., Young, S.J.: Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech Audio Process. 4(5), 352–359 (1996)
Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Englewood Cliffs, NJ (2001)
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., Maas, R.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4. IEEE, New York (2013)
Kolossa, D., Haeb-Umbach, R.: Robust Speech Recognition of Uncertain or Missing Data: Theory and Applications. Springer Science & Business Media, Berlin (2011)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Lee, K.F., Hon, H.W.: Large-vocabulary speaker-independent continuous speech recognition using HMM. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP 88, pp. 123–126. IEEE, New York (1988)
Lee, C.H., Lin, C.H., Juang, B.H.: A study on speaker adaptation of the parameters of continuous density hidden Markov models. IEEE Trans. Signal Process. 39(4), 806–814 (1991)
Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9(2), 171–185 (1995)
Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)
Moreno, P.J., Raj, B., Stern, R.M.: A vector Taylor series approach for environment-independent speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP 96, vol. 2, pp. 733–736. IEEE, New York (1996)
Virtanen, T., Singh, R., Raj, B.: Techniques for Noise Robustness in Automatic Speech Recognition. Wiley, New York (2012)
Watanabe, S., Chien, J.T.: Bayesian Speech and Language Processing. Cambridge University Press, Cambridge (2015)
Yu, D., Deng, L.: Automatic Speech Recognition. Springer, Berlin (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Watanabe, S., Delcroix, M., Metze, F., Hershey, J.R. (2017). Preliminaries. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)