Abstract
In this contribution, conditional Bayesian estimation employing a phasesensitive observation model for noise robust speech recognition will be studied. After a review of speech recognition under the presence of corrupted features, termed uncertainty decoding, the estimation of the posterior distribution of the uncorrupted (clean) feature vector will be shown to be a key element of noise robust speech recognition. The estimation process will be based on three major components: an a priori model of the unobservable data, an observationmodel relating the unobservable data to the corrupted observation and an inference algorithm, finally allowing for a computationally tractable solution. Special stress will be laid on a detailed derivation of the phase-sensitive observation model and the required moments of the phase factor distribution. Thereby, it will not only be proven analytically that the phase factor distribution is non-Gaussian but also that all central moments can (approximately) be computed solely based on the used mel filter bank, finally rendering the moments independent of noise type and signal-to-noise ratio. The phase-sensitive observation model will then be incorporated into a modelbased feature enhancement scheme and recognition experiments will be carried out on the Aurora 2 and Aurora 4 databases. The importance of incorporating phase factor information into the enhancement scheme is pointed out by all recognition results. Application of the proposed scheme under the derived uncertainty decoding framework further leads to significant improvements in both recognition tasks, eventually reaching the performance achieved with the ETSI advanced front-end.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bar-Shalom, Y., Rong Li, X., Kirubarajan, T.: Estimation with Applications to Tracking and Navigation. John Wiley & Sons, Inc. (2001)
Bell, B., Cathey, F.: The iterated Kalman filter update as a Gauss-Newton method. IEEE Transactions on Automatic Control 38(2), 294–297 (1993)
Brillinger, D.R.: Time Series: Data Analysis and Theory. Holt, Rinehart and Winston, Inc. (1975)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39(1), 1–38 (1977)
Deng, J., Bouchard, M., Yeap, T.H.: Noisy speech feature estimation on the Aurora2 database using a switching linear dynamic model. Journal of Multimedia (JMM) 2(2), 47–52 (2007)
Deng, L., Droppo, J., Acero, A.: Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Transactions on Speech and Audio Processing 12(2), 133–143 (2004)
Deng, L., Droppo, J., Acero, A.: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Transactions on Speech and Audio Processing 13(3), 412–421 (2005)
Droppo, J., Acero, A.: Noise robust speech recognition with a switching linear dynamic model. In: A. Acero (ed.) Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–953–6 vol.1. Montreal, Quebec, Canada (2004)
Droppo, J., Acero, A., Deng, L.: A nonlinear observation model for removing noise from corrupted speech log mel-spectral energies. In: Proc. of International Conference on Spoken Language Processing (ICSLP). Denver, Colorado (2002)
Droppo, J., Acero, A., Deng, L.: Uncertainty decoding with splice for noise robust speech recognition. In: A. Acero (ed.) Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–57–I–60 vol.1. Orlando, Florida (2002)
Droppo, J., Deng, L., Alex, A.: A comparison of three non-linear observation models for noisy speech features. In: Proc. Eurospeech, pp. 681–684. International Speech Communication Association, Geneva, Switzerland (2003)
ETSI ES 201 108: Speech processing, transmission and quality aspects; distributed speech recognition; front-end feature extraction algorithm; compression algorithms (2003)
ETSI ES 202 050: Speech processing, transmission and quality aspects; distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms (2007)
Faubel, F., McDonough, J., Klakow, D.: A phase-averaged model for the relationship between noisy speech, clean speech and noise in the log-mel domain. In: Proc. of Annual Conference of the International Speech Communication Association (Interspeech). Interspeech, Brisbane, Australia (2008)
Hirsch, H.G.: Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task AU/417/02. Tech. rep., STQ AURORA DSR WORKING GROUP (2002)
Ion, V., Haeb-Umbach, R.: Uncertainty decoding for distributed speech recognition over error-prone networks. Speech Commununication 48(11), 1435–1446 (2006)
Ion, V., Haeb-Umbach, R.: A novel uncertainty decoding rule with applications to transmission error robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 16(5), 1047–1060 (2008)
Isserlis, L.: On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika 12(1/2), 134–139 (1918)
Kim, N.S., Lim, W., Stern, R.: Feature compensation based on switching linear dynamic model. IEEE Signal Processing Letters 12(6), 473–476 (2005)
Krueger, A., Leutnant, V., Haeb-Umbach, R., Ackermann, M., Bloemer, J.: On the initialization of dynamic models for speech features. In: Proc. of ITG Fachtagung Sprachkommunikation. ITG, Bochum, Germany (2010)
Leonard, R.: A database for speaker independent digit recognition. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 9, pp. 328–331. ICASSP, San Diego, California (1984)
Liao, H., Gales, M.: Issues with uncertainty decoding for noise robust automatic speech recognition. Speech Commununication 50(4), 265–277 (2008)
Martin, R., Lotter, T.: Optimal recursive smoothing of non-stationary periodograms. In: Proc. of International Workshop on Acoustic Echo and Noise Control (IWAENC). Darmstadt, Germany (2001)
Morris, A., Barker, J., Bourlard, H.: From missing data to maybe useful data: Soft data modelling for noise robust ASR. In: Proc. of International Workshop on Innovation in Speech Processing (WISP), 06. Stratford-upon-Avon, England (2001)
Murphy, K.P.: Switching Kalman filters. Tech. rep., U.C. Berkeley (1998)
Paul, D.B., Baker, J.M.: The design for the Wall Street Journal-based CSR corpus. In: HLT ’91: Proceedings of the workshop on Speech and Natural Language, pp. 357–362. Association for Computational Linguistics, Morristown, NJ, USA (1992)
Pearce, D., Hirsch, H.G.: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proc. of International Conference on Spoken Language Processing (ICSLP). Beijing, China (2000)
Stouten, V., Van hamme, H., Wambacq, P.: Effect of phase-sensitive environment model and higher order VTS on noisy speech feature enhancement. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 433–436. Philadelphia, PA, USA (2005)
Stouten, V., Van hamme, H., Wambacq, P.: Model-based feature enhancement with uncertainty decoding for noise robust ASR. Speech Commununication 48(11), 1502–1514 (2006). Robustness Issues for Conversational Interaction
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book V3.4. Cambridge University Press, Cambridge, UK (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Leutnant, V., Haeb-Umbach, R. (2011). Conditional Bayesian Estimation Employing a Phase-Sensitive Observation Model for Noise Robust Speech Recognition. In: Kolossa, D., Häb-Umbach, R. (eds) Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21317-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-21317-5_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21316-8
Online ISBN: 978-3-642-21317-5
eBook Packages: EngineeringEngineering (R0)