Front-End, Back-End, and Hybrid Techniques for Noise-Robust Speech Recognition

Deng, Li

doi:10.1007/978-3-642-21317-5_4

Li Deng³

918 Accesses
19 Citations

Abstract

Noise robustness has long been an active area of research that captures significant interest from speech recognition researchers and developers. In this chapter, with a focus on the problem of uncertainty handling in robust speech recognition, we use the Bayesian framework as a common thread for connecting, analyzing, and categorizing a number of popular approaches to the solutions pursued in the recent past. The topics covered in this chapter include 1) Bayesian decision rules with unreliable features and unreliable model parameters; 2) principled ways of computing feature uncertainty using structured speech distortion models; 3) use of a phase factor in an advanced speech distortion model for feature compensation; 4) a novel perspective on model compensation as a special implementation of the general Bayesian predictive classification rule capitalizing on model parameter uncertainty; 5) taxonomy of noise compensation techniques using two distinct axes, feature vs. model domain and structured vs. unstructured transformation; and 6) noise-adaptive training as a hybrid feature-model compensation framework and its various forms of extension.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Acero: Acoustical and Environmental Robustness in Automatic Speech Recognition. Kluwer Academic Publishers (1993)
Google Scholar
A. Acero, L. Deng, T. Kristjansson, and J. Zhang: HMM adaptation using vector Taylor series for noisy speech recognition. In: Proc. ICSLP, vol.3, pp. 869-872 (2000)
Google Scholar
M. Afify, X. Cui, and Y. Gao: Stereo-based stochastic mapping for robust speech recognition. In: Proc. ICASSP (2007)
Google Scholar
T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul: A compact model for speaker-adaptive training. In: Proc. ICSLP (1996)
Google Scholar
J. Arrowood and M. Clements: Using observation uncertainty in HMM decoding. In: Proc. ICSLP, Denver, Colorado (2002)
Google Scholar
R. F. Astudillo, D. Kolossa, and R. Orglmeister: Accounting for the uncertainty of speech estimates in the complex domain for minimum mean squared error speech enhancement. In: Proc. Interspeech (2009)
Google Scholar
H. Attias, Li Deng, Alex Acero, and John Platt: A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise. In: Proc. of the Eurospeech Conference (2001)
Google Scholar
H. Attias, J. Platt, Alex Acero, and Li Deng: Speech denoising and dereverberation using probabilistic models. In: Proc. NIPS (2000)
Google Scholar
J. Baker, Li Deng, Jim Glass, S. Khudanpur, C.-H. Lee, N. Morgan, and D. O’Shaughnessy: Research developments and directions in speech recognition and understanding. IEEE Signal Processing Magazine, vol. 26, no. 3, pp. 75-80 (2009)
Google Scholar
J. Baker, Li Deng, S. Khudanpur, C.-H. Lee, J. Glass, N. Morgan, and D. O’Shaughnessy: Updated MINDS report on speech recognition and understanding. IEEE Signal Processing Magazine, vol. 26, no. 4 (2009)
Google Scholar
J. Bilmes and C. Bartels: Graphical model architectures for speech recognition. IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 89-100 (2005)
Article Google Scholar
S.F. Boll: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. on Acoustics, Speech, and Signal Processing, 27:113-120 (1979)
Google Scholar
K. Demuynck, X. Zhang, D. Van Compernolle, and H. Van hamme: Feature versus model based noise robustness. In: Proc. Interspeech (2010)
Google Scholar
L. Deng: Computational models for auditory speech processing. In: Computational Models of Speech Pattern Processing, (NATO ASI Series), pp. 67-77, Springer Verlag (1999)
Google Scholar
L. Deng: Computational models for speech production. Computational Models of Speech Pattern Processing, (NATO ASI Series), pp. 199-213, Springer Verlag (1999)
Google Scholar
L. Deng, D. Yu, and A. Acero: Structured speech modeling. IEEE Trans. on Audio, Speech and Language Processing (Special Issue on Rich Transcription), vol. 14, No. 5, pp. 1492-1504 (2006)
Google Scholar
L. Deng, A. Acero, M. Plumpe, and X.D. Huang: Large vocabulary speech recognition under adverse acoustic environments. In: Proc. ICSLP, pp. 806-809 (2000)
Google Scholar
L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang: High-performance robust speech recognition using stereo training data. In: Proc. ICASSP, Salt Lake City, Utah (2001)
Google Scholar
L. Deng, J. Droppo, and A. Acero: Exploiting variances in robust feature extraction based on a parametric model of speech distortion. In: Proc. ICSLP (2002)
Google Scholar
Li Deng, Jasha Droppo, and Alex Acero: A Bayesian approach to speech feature enhancement using the dynamic cepstral prior. In: Proc. ICASSP, Orlando, Florida (2002)
Google Scholar
L. Deng, J. Droppo, and A. Acero: Log-domain speech feature enhancement using sequential MAP noise estimation and a phase-sensitive model of the acoustic environment. In: Proc. ICSLP, Denver, Colorado (2002)
Google Scholar
L. Deng, K. Wang, A. Acero, H. Hon, J. Droppo, C. Boulis, Y. Wang, D. Jacoby, M. Mahajan, C. Chelba, and XD. Huang: Distributed speech processing in MiPad’s multimodal user interface. IEEE Trans. on Speech and Audio Processing, vol. 10, no. 8, pp. 605-619 (2002)
Google Scholar
L. Deng, J. Droppo, and A. Acero: Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Trans. on Speech and Audio Processing, vol.12, no. 2, pp. 133-143 2004)
Article Google Scholar
Li Deng and Xuedong Huang: Challenges in adopting speech recognition. Communications of the ACM, vol. 47, no. 1, pp. 11-13, (2004)
Google Scholar
Li Deng, Jasha Droppo, and Alex Acero: Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. IEEE Trans. on Speech and Audio Processing, vol. 11, no. 6, pp. 568-580 (2003)
Google Scholar
Li Deng, Jasha Droppo, and Alex Acero: Incremental Bayes Learning with Prior Evolution for Tracking Non-Stationary Noise Statistics from Noisy Speech Data. In: Proc. ICASSP, Hong Kong (2003)
Google Scholar
Li Deng, Jasha Droppo, and Alex Acero: Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features. IEEE Trans. on Speech and Audio Processing, vol. 12, no. 3, pp. 218-233 (2004)
Google Scholar
L. Deng, J. Droppo, and A. Acero: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. on Speech and Audio Processing, vol. 12, no. 3, (2005)
Google Scholar
Li Deng, Mike Seltzer, Dong Yu, Alex Acero, A. Mohamed, and Geoff Hinton: Binary coding of speech spectrograms using a deep auto-encoder. In: Proc. Interspeech (2010)
Google Scholar
J. Droppo, A. Acero, and L. Deng: Efficient online acoustic environment estimation for FCDCN in a continuous speech recognition system. In: Proc. ICASSP, Salt Lake City, Utah (2001)
Google Scholar
J. Droppo, A. Acero, and L. Deng: A nonlinear observation model for removing noise from corrupted speech log Mel-spectral energies. In: Proc. ICSLP, Denver, Colorado (2002)
Google Scholar
J. Droppo, A. Acero, and L. Deng: Uncertainty decoding with SPLICE for noise robust speech recognition. In: Proc. ICASSP, Orlando, Florida (2002)
Google Scholar
J. Droppo, L. Deng, and A. Acero: Evaluation of SPLICE on the Aurora 2 and 3 Tasks. In: Proc. ICSLP, Denver, Colorado (2002)
Google Scholar
J. Droppo and A. Acero: Environmental Robustness. In: Handbook of Speech Processing, Springer (2007)
Google Scholar
Y. Ephraim: A Bayesian estimation approach for speech enhancement using hidden Markov models. IEEE Trans. on Acoustics, Speech, and Signal Processing, 40:725-735 (1992)
Google Scholar
Y. Ephraim and D. Malah: Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109-1121 (1984)
Google Scholar
B. Frey, L. Deng, A. Acero, and T.T. Kristjansson: Algonquin: Iterating Laplace’s method to remove multiple types of acoustic distortion for robust speech recognition. In: Proc. Eurospeech, Aalborg, Denmark (2001)
Google Scholar
B. Frey, T. Kristjansson, Li Deng, and Alex Acero: Learning dynamic noise models from noisy speech for robust speech recognition. In: Proc. Advances in Neural Information Processing Systems (NIPS), vol. 14, Vancouver, Canada, 2001, pp. 101-108 (2001)
Google Scholar
M.J.F. Gales and S.J. Young: Robust speech recognition in additive and convolutional noise using parallel model combination. Computer Speech and Language, 9:289-307 (1995)
Article Google Scholar
M. J. F. Gales: Maximum Likelihood Linear Transformations For HMM-Based Speech Recognition. Computer Speech and Language, 12 (January 1998)
Google Scholar
M.J.F. Gales: Model-based approaches to handling uncertainty. Chapter 5 of this book (2011)
Google Scholar
G. Hinton, S. Osindero, and Y. Teh: A fast learning algorithm for deep belief nets. Neural Computation, vol. 18, pp. 1527-1554, 2006)
Article MATH MathSciNet Google Scholar
R. Haeb-Umbach and V. Ion: Soft features for improved distributed speech recognition over wireless networks. In: Proc. Interspeech (2004)
Google Scholar
X. He, L. Deng, and W. Chou: Discriminative learning in sequential pattern recognition — A unifying review. IEEE Signal Processing Magazine (2008)
Google Scholar
J. Hershey, S. Rennie, P. Olsen, and T. Kristjansson: Super-human multi-talker speech recognition: A graphical modeling approach. Computer Speech and Language (June 2010)
Google Scholar
H. G. Hirsch and D. Pearce: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proc. ISCA ITRW ASR (2000)
Google Scholar
C. Hsieh and C. Wu: Stochastic vector mapping-based feature enhancement using prior-models and model adaptation for noisy speech recognition. Speech Communication, vol. 50, No. 6, pp. 467-475 (2008)
Article Google Scholar
Y. Hu and Q. Huo: Irrelevant variability normalization based HMM training using VTS approximation of an explicit model of environmental distortions. In: Proc. Interspeech (2007)
Google Scholar
C.-H. Lee and Q. Huo: On adaptive decision rules and decision parameter adaptation for automatic speech recognition. Proc. of the IEEE, vol. 88, No. 8, pp. 1241-1269 (2000)
Article Google Scholar
V. Ion and R. Haeb-Umbach: Uncertainty decoding for distributed speech recognition over error-prone networks. Speech Communication, vol. 48, pp. 1435-1446 (2006)
Article Google Scholar
V. Ion and R. Haeb-Umbach: A novel uncertainty decoding rule with applications to transmission error robust speech recognition. IEEE Trans. Speech and Audio Processing, vol. 16. No. 5, pp. 1047-1060 (2008)
Article Google Scholar
H. Jiang and Li Deng: A Bayesian approach to the verification problem: Applications to speaker verification. IEEE Trans. Speech and Audio Proc., vol. 9, No. 8, pp. 874-884 (2001)
Article Google Scholar
H. Jiang and L. Deng: A robust compensation strategy against extraneous acoustic variations in spontaneous speech recognition. IEEE Trans. on Speech and Audio Processing, vol. 10, no. 1, pp. 9-17 (2002)
Article Google Scholar
O. Kalinli, M.L. Seltzer, and A. Acero: Noise adaptive training using a vector Taylor series approach for noise robust automatic speech recognition. In: Proc. ICASSP, pages 3825-3828, Taipei, Taiwan (2009)
Google Scholar
D. Kim and M. Gales: Noisy constrained maximum likelihood linear regression for noise robust speech recognition. IEEE Trans. Audio Speech and Language Processing (2010)
Google Scholar
D.Y. Kim, C.K. Un, and N.S. Kim: Speech recognition in noisy environments using first-order vector Taylor series. Speech Communication, vol. 24, pp. 39-49 (1998)
Article Google Scholar
T.T. Kristjansson and B.J. Frey: Accounting for uncertainty in observations: A new paradigm for robust speech recognition. In: Proc. ICASSP, Orlando, Florida (2002)
Google Scholar
T.T. Kristjansson, B. Frey, L. Deng, and A. Acero: Towards non-stationary model-based noise adaptation for large vocabulary speech recognition. In: Proc. ICASSP (2001)
Google Scholar
C.-H. Lee: On stochastic feature and model compensation approaches to robust speech recognition. Speech Communication, vol. 25, pp. 29-47 (1998).
Article Google Scholar
V. Leutnant and R. Haeb-Umbach: An analytic derivation of a phase-sensitive observation model for noise robust speech recognition. In: Proc. Interspeech (2009)
Google Scholar
J. Li, D. Yu, Y. Gong, and Li Deng: Unscented Transform with Online Distortion Estimation for HMM Adaptation. In: Proc. Interspeech (2010)
Google Scholar
J. Li, D. Yu, L. Deng, Y. Gong, and A. Acero: A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions. Computer Speech and Language, vol. 23, pp. 389-405 (2009)
Article Google Scholar
J. Li, L. Deng, D. Yu, Y. Gong, and A. Acero: HMM Adaptation Using a Phase-Sensitive Acoustic Distortion Model for Environment-Robust Speech Recognition. In: Proc. ICASSP, Las Vegas (2008)
Google Scholar
J. Li, L. Deng, D. Yu, J. Wu, Y. Gong, and A. Acero: Adaptation of compressed HMM parameters for resource-constrained speech recognition. In: Proc. ICASSP, Las Vegas (2008)
Google Scholar
H. Liao and M. J. F. Gales: Issues with uncertainty decoding for noise robust speech recognition. In: Proc. ICSLP, pp. 1121-1124 (2006)
Google Scholar
H. Liao and M. J. F. Gales: Adaptive training with joint uncertainty decoding for robust recognition of noisy data. In: Proc. ICASSP, vol. IV, pp. 389-392 (2007)
Google Scholar
H. Liao and M.J.F. Gales: Joint uncertainty decoding for noise robust speech recognition. In: Proc. Interspeech (2005)
Google Scholar
Hui Lin, Li Deng, Dong Yu, Yifan Gong, Alex Acero, and Chi-Hui Lee: A study on multilingual acoustic modeling for large vocabulary ASR. In: Proc. ICASSP (2009)
Google Scholar
R. Lyon: Machine hearing: An emerging field. IEEE Signal Processing Magazine (September 2010)
Google Scholar
A. Mohamed, D. Yu, and L. Deng: Investigation of full-sequence training of deep belief networks for speech recognition. In: Proc. Interspeech (2010)
Google Scholar
P. Moreno: Speech Recognition in Noisy Environments. Ph.D. Thesis, Carnegie Mellon University (1996)
Google Scholar
N. Morgan et al.: Pushing the envelope — Aside. IEEE Signal Processing Magazine, vol. 22, No. 5, pp. 81-88 (2005)
Article Google Scholar
R. Munkong and B.-H. Juang: Auditory perception and cognition — Modularization and integration of signal processing from ears to brain. IEEE Signal Processing Magazine, vol. 25, No. 3, pp. 98-117 (2008)
Article Google Scholar
C. Rathinavalu and L. Deng: HMM-based speech recognition using state-dependent, discriminatively derived transforms on Mel-warped DFT features. IEEE Trans. on Speech and Audio Processing, pp. 243-256 (1997)
Google Scholar
S. Rennie, J. Hershey, P. Olsen: Combining variational methods and loopy belief propagation for multi-talker speech recognition. IEEE Signal Processing Magazine, Special issue of Graphical Models for Signal Processing (Eds. M. Jordan et al.), (November 2010)
Google Scholar
H. Sameti, H. Sheikhzadeh, Li Deng, and R. Brennan: HMM-based strategies for enhancement of speech signals embedded in nonstationary noise. IEEE Trans. on Speech and Audio Processing, vol. 6, no. 5, pp. 445-455 (1998)
Google Scholar
H. Sameti and Li Deng: Nonstationary-state hidden Markov model representation of speech signals for speech enhancement. Signal Processing, vol. 82, pp. 205-227 (2002)
Google Scholar
M. Seltzer, K. Kalgaonkar, and A. Acero: Acoustic model adaptation via linear spline interpolation for robust speech recognition. In: Proc. ICASSP (2010)
Google Scholar
H. Sheikhzadeh and Li Deng: Waveform-based speech recognition using hidden filter models: Parameter selection and sensitivity to power normalization. IEEE Trans. on Speech and Audio Processing, vol. 2, no. 1, pp. 80-91 (1994)
Article Google Scholar
G. Shi, Y. Shi, and Q. Huo: A study of irrelevant variability normalizataion based training and unsupervised online adaptation for LVCSR. In: Proc. Interspeech, Makuhari, Japan (2010)
Google Scholar
V. Stouten,, H. Van hamme, P. Wambacq: Effect of phase-sensitive environment model and higher order VTS on noisy speech feature enhancement. In: Proc. ICASSP, pp. 433-436 (2005)
Google Scholar
V. Stouten, H. Van hamme, and P. Wambacq: Accounting for the uncertainty of speech estimates in the context of model-based feature enhancement. In: Proc. ICSLP, pp. 105-108, Jeju Island, Korea (2004)
Google Scholar
D. Yu, Li Deng, Yifan Gong, and Alex Acero: A novel framework and training algorithm for variable-parameter hidden Markov models. IEEE Trans. on Audio, Speech and Language Processing, vol. 17, no. 7, pp. 1348-1360, IEEE (2009)
Google Scholar
D. Yu and Li Deng: Solving nonlinear estimation problems using Splines. IEEE Signal Processing Magazine, vol. 26, no. 4, pp. 86-90, (2009)
Article Google Scholar
D. Yu, Li Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero: Robust speech recognition using cepstral minimum-mean-square-error noise suppressor. IEEE Trans. Audio, Speech, and Language Processing, vol. 16, no. 5 (2008)
Google Scholar
D. Yu and L. Deng: Deep-Structured Hidden Conditional Random Fields for Phonetic Recognition. In: Proc. Interspeech (2010)
Google Scholar
D. Zhu and Q. Huo: A maximum likelihood approach to unsupervised online adaptation of stochastic vector mapping function for robust speech recognition. In: Proc. ICASSP (2007)
Google Scholar
D. Zhu and Q. Huo: Irrelevant variability normalization based HMM training using MAP estimation of feature transforms for robust speech recognition. In: Proc. ICASSP (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, One Microsoft Way, Redmond, WA, 98052, USA
Li Deng

Authors

Li Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Deng .

Editor information

Editors and Affiliations

Institute of Communication Acoustics, Ruhr-Universität Bochum, Universitätsstrasse 150, Bochum, 44801, Germany
Dorothea Kolossa
, Dept. of Communications Engineering, University of Paderborn, Warburger Strasse 100, Paderborn, 33098, Germany
Reinhold Häb-Umbach

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Deng, L. (2011). Front-End, Back-End, and Hybrid Techniques for Noise-Robust Speech Recognition. In: Kolossa, D., Häb-Umbach, R. (eds) Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21317-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-21317-5_4
Published: 23 June 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21316-8
Online ISBN: 978-3-642-21317-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics