Abstract
Noise robustness has long been an active area of research that captures significant interest from speech recognition researchers and developers. In this chapter, with a focus on the problem of uncertainty handling in robust speech recognition, we use the Bayesian framework as a common thread for connecting, analyzing, and categorizing a number of popular approaches to the solutions pursued in the recent past. The topics covered in this chapter include 1) Bayesian decision rules with unreliable features and unreliable model parameters; 2) principled ways of computing feature uncertainty using structured speech distortion models; 3) use of a phase factor in an advanced speech distortion model for feature compensation; 4) a novel perspective on model compensation as a special implementation of the general Bayesian predictive classification rule capitalizing on model parameter uncertainty; 5) taxonomy of noise compensation techniques using two distinct axes, feature vs. model domain and structured vs. unstructured transformation; and 6) noise-adaptive training as a hybrid feature-model compensation framework and its various forms of extension.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Acero: Acoustical and Environmental Robustness in Automatic Speech Recognition. Kluwer Academic Publishers (1993)
A. Acero, L. Deng, T. Kristjansson, and J. Zhang: HMM adaptation using vector Taylor series for noisy speech recognition. In: Proc. ICSLP, vol.3, pp. 869-872 (2000)
M. Afify, X. Cui, and Y. Gao: Stereo-based stochastic mapping for robust speech recognition. In: Proc. ICASSP (2007)
T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul: A compact model for speaker-adaptive training. In: Proc. ICSLP (1996)
J. Arrowood and M. Clements: Using observation uncertainty in HMM decoding. In: Proc. ICSLP, Denver, Colorado (2002)
R. F. Astudillo, D. Kolossa, and R. Orglmeister: Accounting for the uncertainty of speech estimates in the complex domain for minimum mean squared error speech enhancement. In: Proc. Interspeech (2009)
H. Attias, Li Deng, Alex Acero, and John Platt: A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise. In: Proc. of the Eurospeech Conference (2001)
H. Attias, J. Platt, Alex Acero, and Li Deng: Speech denoising and dereverberation using probabilistic models. In: Proc. NIPS (2000)
J. Baker, Li Deng, Jim Glass, S. Khudanpur, C.-H. Lee, N. Morgan, and D. O’Shaughnessy: Research developments and directions in speech recognition and understanding. IEEE Signal Processing Magazine, vol. 26, no. 3, pp. 75-80 (2009)
J. Baker, Li Deng, S. Khudanpur, C.-H. Lee, J. Glass, N. Morgan, and D. O’Shaughnessy: Updated MINDS report on speech recognition and understanding. IEEE Signal Processing Magazine, vol. 26, no. 4 (2009)
J. Bilmes and C. Bartels: Graphical model architectures for speech recognition. IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 89-100 (2005)
S.F. Boll: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. on Acoustics, Speech, and Signal Processing, 27:113-120 (1979)
K. Demuynck, X. Zhang, D. Van Compernolle, and H. Van hamme: Feature versus model based noise robustness. In: Proc. Interspeech (2010)
L. Deng: Computational models for auditory speech processing. In: Computational Models of Speech Pattern Processing, (NATO ASI Series), pp. 67-77, Springer Verlag (1999)
L. Deng: Computational models for speech production. Computational Models of Speech Pattern Processing, (NATO ASI Series), pp. 199-213, Springer Verlag (1999)
L. Deng, D. Yu, and A. Acero: Structured speech modeling. IEEE Trans. on Audio, Speech and Language Processing (Special Issue on Rich Transcription), vol. 14, No. 5, pp. 1492-1504 (2006)
L. Deng, A. Acero, M. Plumpe, and X.D. Huang: Large vocabulary speech recognition under adverse acoustic environments. In: Proc. ICSLP, pp. 806-809 (2000)
L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang: High-performance robust speech recognition using stereo training data. In: Proc. ICASSP, Salt Lake City, Utah (2001)
L. Deng, J. Droppo, and A. Acero: Exploiting variances in robust feature extraction based on a parametric model of speech distortion. In: Proc. ICSLP (2002)
Li Deng, Jasha Droppo, and Alex Acero: A Bayesian approach to speech feature enhancement using the dynamic cepstral prior. In: Proc. ICASSP, Orlando, Florida (2002)
L. Deng, J. Droppo, and A. Acero: Log-domain speech feature enhancement using sequential MAP noise estimation and a phase-sensitive model of the acoustic environment. In: Proc. ICSLP, Denver, Colorado (2002)
L. Deng, K. Wang, A. Acero, H. Hon, J. Droppo, C. Boulis, Y. Wang, D. Jacoby, M. Mahajan, C. Chelba, and XD. Huang: Distributed speech processing in MiPad’s multimodal user interface. IEEE Trans. on Speech and Audio Processing, vol. 10, no. 8, pp. 605-619 (2002)
L. Deng, J. Droppo, and A. Acero: Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Trans. on Speech and Audio Processing, vol.12, no. 2, pp. 133-143 2004)
Li Deng and Xuedong Huang: Challenges in adopting speech recognition. Communications of the ACM, vol. 47, no. 1, pp. 11-13, (2004)
Li Deng, Jasha Droppo, and Alex Acero: Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. IEEE Trans. on Speech and Audio Processing, vol. 11, no. 6, pp. 568-580 (2003)
Li Deng, Jasha Droppo, and Alex Acero: Incremental Bayes Learning with Prior Evolution for Tracking Non-Stationary Noise Statistics from Noisy Speech Data. In: Proc. ICASSP, Hong Kong (2003)
Li Deng, Jasha Droppo, and Alex Acero: Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features. IEEE Trans. on Speech and Audio Processing, vol. 12, no. 3, pp. 218-233 (2004)
L. Deng, J. Droppo, and A. Acero: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. on Speech and Audio Processing, vol. 12, no. 3, (2005)
Li Deng, Mike Seltzer, Dong Yu, Alex Acero, A. Mohamed, and Geoff Hinton: Binary coding of speech spectrograms using a deep auto-encoder. In: Proc. Interspeech (2010)
J. Droppo, A. Acero, and L. Deng: Efficient online acoustic environment estimation for FCDCN in a continuous speech recognition system. In: Proc. ICASSP, Salt Lake City, Utah (2001)
J. Droppo, A. Acero, and L. Deng: A nonlinear observation model for removing noise from corrupted speech log Mel-spectral energies. In: Proc. ICSLP, Denver, Colorado (2002)
J. Droppo, A. Acero, and L. Deng: Uncertainty decoding with SPLICE for noise robust speech recognition. In: Proc. ICASSP, Orlando, Florida (2002)
J. Droppo, L. Deng, and A. Acero: Evaluation of SPLICE on the Aurora 2 and 3 Tasks. In: Proc. ICSLP, Denver, Colorado (2002)
J. Droppo and A. Acero: Environmental Robustness. In: Handbook of Speech Processing, Springer (2007)
Y. Ephraim: A Bayesian estimation approach for speech enhancement using hidden Markov models. IEEE Trans. on Acoustics, Speech, and Signal Processing, 40:725-735 (1992)
Y. Ephraim and D. Malah: Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109-1121 (1984)
B. Frey, L. Deng, A. Acero, and T.T. Kristjansson: Algonquin: Iterating Laplace’s method to remove multiple types of acoustic distortion for robust speech recognition. In: Proc. Eurospeech, Aalborg, Denmark (2001)
B. Frey, T. Kristjansson, Li Deng, and Alex Acero: Learning dynamic noise models from noisy speech for robust speech recognition. In: Proc. Advances in Neural Information Processing Systems (NIPS), vol. 14, Vancouver, Canada, 2001, pp. 101-108 (2001)
M.J.F. Gales and S.J. Young: Robust speech recognition in additive and convolutional noise using parallel model combination. Computer Speech and Language, 9:289-307 (1995)
M. J. F. Gales: Maximum Likelihood Linear Transformations For HMM-Based Speech Recognition. Computer Speech and Language, 12 (January 1998)
M.J.F. Gales: Model-based approaches to handling uncertainty. Chapter 5 of this book (2011)
G. Hinton, S. Osindero, and Y. Teh: A fast learning algorithm for deep belief nets. Neural Computation, vol. 18, pp. 1527-1554, 2006)
R. Haeb-Umbach and V. Ion: Soft features for improved distributed speech recognition over wireless networks. In: Proc. Interspeech (2004)
X. He, L. Deng, and W. Chou: Discriminative learning in sequential pattern recognition — A unifying review. IEEE Signal Processing Magazine (2008)
J. Hershey, S. Rennie, P. Olsen, and T. Kristjansson: Super-human multi-talker speech recognition: A graphical modeling approach. Computer Speech and Language (June 2010)
H. G. Hirsch and D. Pearce: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proc. ISCA ITRW ASR (2000)
C. Hsieh and C. Wu: Stochastic vector mapping-based feature enhancement using prior-models and model adaptation for noisy speech recognition. Speech Communication, vol. 50, No. 6, pp. 467-475 (2008)
Y. Hu and Q. Huo: Irrelevant variability normalization based HMM training using VTS approximation of an explicit model of environmental distortions. In: Proc. Interspeech (2007)
C.-H. Lee and Q. Huo: On adaptive decision rules and decision parameter adaptation for automatic speech recognition. Proc. of the IEEE, vol. 88, No. 8, pp. 1241-1269 (2000)
V. Ion and R. Haeb-Umbach: Uncertainty decoding for distributed speech recognition over error-prone networks. Speech Communication, vol. 48, pp. 1435-1446 (2006)
V. Ion and R. Haeb-Umbach: A novel uncertainty decoding rule with applications to transmission error robust speech recognition. IEEE Trans. Speech and Audio Processing, vol. 16. No. 5, pp. 1047-1060 (2008)
H. Jiang and Li Deng: A Bayesian approach to the verification problem: Applications to speaker verification. IEEE Trans. Speech and Audio Proc., vol. 9, No. 8, pp. 874-884 (2001)
H. Jiang and L. Deng: A robust compensation strategy against extraneous acoustic variations in spontaneous speech recognition. IEEE Trans. on Speech and Audio Processing, vol. 10, no. 1, pp. 9-17 (2002)
O. Kalinli, M.L. Seltzer, and A. Acero: Noise adaptive training using a vector Taylor series approach for noise robust automatic speech recognition. In: Proc. ICASSP, pages 3825-3828, Taipei, Taiwan (2009)
D. Kim and M. Gales: Noisy constrained maximum likelihood linear regression for noise robust speech recognition. IEEE Trans. Audio Speech and Language Processing (2010)
D.Y. Kim, C.K. Un, and N.S. Kim: Speech recognition in noisy environments using first-order vector Taylor series. Speech Communication, vol. 24, pp. 39-49 (1998)
T.T. Kristjansson and B.J. Frey: Accounting for uncertainty in observations: A new paradigm for robust speech recognition. In: Proc. ICASSP, Orlando, Florida (2002)
T.T. Kristjansson, B. Frey, L. Deng, and A. Acero: Towards non-stationary model-based noise adaptation for large vocabulary speech recognition. In: Proc. ICASSP (2001)
C.-H. Lee: On stochastic feature and model compensation approaches to robust speech recognition. Speech Communication, vol. 25, pp. 29-47 (1998).
V. Leutnant and R. Haeb-Umbach: An analytic derivation of a phase-sensitive observation model for noise robust speech recognition. In: Proc. Interspeech (2009)
J. Li, D. Yu, Y. Gong, and Li Deng: Unscented Transform with Online Distortion Estimation for HMM Adaptation. In: Proc. Interspeech (2010)
J. Li, D. Yu, L. Deng, Y. Gong, and A. Acero: A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions. Computer Speech and Language, vol. 23, pp. 389-405 (2009)
J. Li, L. Deng, D. Yu, Y. Gong, and A. Acero: HMM Adaptation Using a Phase-Sensitive Acoustic Distortion Model for Environment-Robust Speech Recognition. In: Proc. ICASSP, Las Vegas (2008)
J. Li, L. Deng, D. Yu, J. Wu, Y. Gong, and A. Acero: Adaptation of compressed HMM parameters for resource-constrained speech recognition. In: Proc. ICASSP, Las Vegas (2008)
H. Liao and M. J. F. Gales: Issues with uncertainty decoding for noise robust speech recognition. In: Proc. ICSLP, pp. 1121-1124 (2006)
H. Liao and M. J. F. Gales: Adaptive training with joint uncertainty decoding for robust recognition of noisy data. In: Proc. ICASSP, vol. IV, pp. 389-392 (2007)
H. Liao and M.J.F. Gales: Joint uncertainty decoding for noise robust speech recognition. In: Proc. Interspeech (2005)
Hui Lin, Li Deng, Dong Yu, Yifan Gong, Alex Acero, and Chi-Hui Lee: A study on multilingual acoustic modeling for large vocabulary ASR. In: Proc. ICASSP (2009)
R. Lyon: Machine hearing: An emerging field. IEEE Signal Processing Magazine (September 2010)
A. Mohamed, D. Yu, and L. Deng: Investigation of full-sequence training of deep belief networks for speech recognition. In: Proc. Interspeech (2010)
P. Moreno: Speech Recognition in Noisy Environments. Ph.D. Thesis, Carnegie Mellon University (1996)
N. Morgan et al.: Pushing the envelope — Aside. IEEE Signal Processing Magazine, vol. 22, No. 5, pp. 81-88 (2005)
R. Munkong and B.-H. Juang: Auditory perception and cognition — Modularization and integration of signal processing from ears to brain. IEEE Signal Processing Magazine, vol. 25, No. 3, pp. 98-117 (2008)
C. Rathinavalu and L. Deng: HMM-based speech recognition using state-dependent, discriminatively derived transforms on Mel-warped DFT features. IEEE Trans. on Speech and Audio Processing, pp. 243-256 (1997)
S. Rennie, J. Hershey, P. Olsen: Combining variational methods and loopy belief propagation for multi-talker speech recognition. IEEE Signal Processing Magazine, Special issue of Graphical Models for Signal Processing (Eds. M. Jordan et al.), (November 2010)
H. Sameti, H. Sheikhzadeh, Li Deng, and R. Brennan: HMM-based strategies for enhancement of speech signals embedded in nonstationary noise. IEEE Trans. on Speech and Audio Processing, vol. 6, no. 5, pp. 445-455 (1998)
H. Sameti and Li Deng: Nonstationary-state hidden Markov model representation of speech signals for speech enhancement. Signal Processing, vol. 82, pp. 205-227 (2002)
M. Seltzer, K. Kalgaonkar, and A. Acero: Acoustic model adaptation via linear spline interpolation for robust speech recognition. In: Proc. ICASSP (2010)
H. Sheikhzadeh and Li Deng: Waveform-based speech recognition using hidden filter models: Parameter selection and sensitivity to power normalization. IEEE Trans. on Speech and Audio Processing, vol. 2, no. 1, pp. 80-91 (1994)
G. Shi, Y. Shi, and Q. Huo: A study of irrelevant variability normalizataion based training and unsupervised online adaptation for LVCSR. In: Proc. Interspeech, Makuhari, Japan (2010)
V. Stouten,, H. Van hamme, P. Wambacq: Effect of phase-sensitive environment model and higher order VTS on noisy speech feature enhancement. In: Proc. ICASSP, pp. 433-436 (2005)
V. Stouten, H. Van hamme, and P. Wambacq: Accounting for the uncertainty of speech estimates in the context of model-based feature enhancement. In: Proc. ICSLP, pp. 105-108, Jeju Island, Korea (2004)
D. Yu, Li Deng, Yifan Gong, and Alex Acero: A novel framework and training algorithm for variable-parameter hidden Markov models. IEEE Trans. on Audio, Speech and Language Processing, vol. 17, no. 7, pp. 1348-1360, IEEE (2009)
D. Yu and Li Deng: Solving nonlinear estimation problems using Splines. IEEE Signal Processing Magazine, vol. 26, no. 4, pp. 86-90, (2009)
D. Yu, Li Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero: Robust speech recognition using cepstral minimum-mean-square-error noise suppressor. IEEE Trans. Audio, Speech, and Language Processing, vol. 16, no. 5 (2008)
D. Yu and L. Deng: Deep-Structured Hidden Conditional Random Fields for Phonetic Recognition. In: Proc. Interspeech (2010)
D. Zhu and Q. Huo: A maximum likelihood approach to unsupervised online adaptation of stochastic vector mapping function for robust speech recognition. In: Proc. ICASSP (2007)
D. Zhu and Q. Huo: Irrelevant variability normalization based HMM training using MAP estimation of feature transforms for robust speech recognition. In: Proc. ICASSP (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Deng, L. (2011). Front-End, Back-End, and Hybrid Techniques for Noise-Robust Speech Recognition. In: Kolossa, D., Häb-Umbach, R. (eds) Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21317-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-21317-5_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21316-8
Online ISBN: 978-3-642-21317-5
eBook Packages: EngineeringEngineering (R0)