Conditional Bayesian Estimation Employing a Phase-Sensitive Observation Model for Noise Robust Speech Recognition

Leutnant, Volker; Haeb-Umbach, Reinhold

doi:10.1007/978-3-642-21317-5_8

Volker Leutnant³ &
Reinhold Haeb-Umbach³

868 Accesses

Abstract

In this contribution, conditional Bayesian estimation employing a phasesensitive observation model for noise robust speech recognition will be studied. After a review of speech recognition under the presence of corrupted features, termed uncertainty decoding, the estimation of the posterior distribution of the uncorrupted (clean) feature vector will be shown to be a key element of noise robust speech recognition. The estimation process will be based on three major components: an a priori model of the unobservable data, an observationmodel relating the unobservable data to the corrupted observation and an inference algorithm, finally allowing for a computationally tractable solution. Special stress will be laid on a detailed derivation of the phase-sensitive observation model and the required moments of the phase factor distribution. Thereby, it will not only be proven analytically that the phase factor distribution is non-Gaussian but also that all central moments can (approximately) be computed solely based on the used mel filter bank, finally rendering the moments independent of noise type and signal-to-noise ratio. The phase-sensitive observation model will then be incorporated into a modelbased feature enhancement scheme and recognition experiments will be carried out on the Aurora 2 and Aurora 4 databases. The importance of incorporating phase factor information into the enhancement scheme is pointed out by all recognition results. Application of the proposed scheme under the derived uncertainty decoding framework further leads to significant improvements in both recognition tasks, eventually reaching the performance achieved with the ETSI advanced front-end.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Article 06 January 2017

Bayesian estimation for speech enhancement given a priori knowledge of clean speech phase

Article 10 September 2015

A Bayesian view on acoustic model-based techniques for robust speech recognition

Article Open access 02 December 2015

References

Bar-Shalom, Y., Rong Li, X., Kirubarajan, T.: Estimation with Applications to Tracking and Navigation. John Wiley & Sons, Inc. (2001)
Google Scholar
Bell, B., Cathey, F.: The iterated Kalman filter update as a Gauss-Newton method. IEEE Transactions on Automatic Control 38(2), 294–297 (1993)
Article MATH MathSciNet Google Scholar
Brillinger, D.R.: Time Series: Data Analysis and Theory. Holt, Rinehart and Winston, Inc. (1975)
MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39(1), 1–38 (1977)
Google Scholar
Deng, J., Bouchard, M., Yeap, T.H.: Noisy speech feature estimation on the Aurora2 database using a switching linear dynamic model. Journal of Multimedia (JMM) 2(2), 47–52 (2007)
Google Scholar
Deng, L., Droppo, J., Acero, A.: Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Transactions on Speech and Audio Processing 12(2), 133–143 (2004)
Article Google Scholar
Deng, L., Droppo, J., Acero, A.: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Transactions on Speech and Audio Processing 13(3), 412–421 (2005)
Article Google Scholar
Droppo, J., Acero, A.: Noise robust speech recognition with a switching linear dynamic model. In: A. Acero (ed.) Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–953–6 vol.1. Montreal, Quebec, Canada (2004)
Google Scholar
Droppo, J., Acero, A., Deng, L.: A nonlinear observation model for removing noise from corrupted speech log mel-spectral energies. In: Proc. of International Conference on Spoken Language Processing (ICSLP). Denver, Colorado (2002)
Google Scholar
Droppo, J., Acero, A., Deng, L.: Uncertainty decoding with splice for noise robust speech recognition. In: A. Acero (ed.) Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–57–I–60 vol.1. Orlando, Florida (2002)
Google Scholar
Droppo, J., Deng, L., Alex, A.: A comparison of three non-linear observation models for noisy speech features. In: Proc. Eurospeech, pp. 681–684. International Speech Communication Association, Geneva, Switzerland (2003)
Google Scholar
ETSI ES 201 108: Speech processing, transmission and quality aspects; distributed speech recognition; front-end feature extraction algorithm; compression algorithms (2003)
Google Scholar
ETSI ES 202 050: Speech processing, transmission and quality aspects; distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms (2007)
Google Scholar
Faubel, F., McDonough, J., Klakow, D.: A phase-averaged model for the relationship between noisy speech, clean speech and noise in the log-mel domain. In: Proc. of Annual Conference of the International Speech Communication Association (Interspeech). Interspeech, Brisbane, Australia (2008)
Google Scholar
Hirsch, H.G.: Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task AU/417/02. Tech. rep., STQ AURORA DSR WORKING GROUP (2002)
Google Scholar
Ion, V., Haeb-Umbach, R.: Uncertainty decoding for distributed speech recognition over error-prone networks. Speech Commununication 48(11), 1435–1446 (2006)
Article Google Scholar
Ion, V., Haeb-Umbach, R.: A novel uncertainty decoding rule with applications to transmission error robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 16(5), 1047–1060 (2008)
Article Google Scholar
Isserlis, L.: On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika 12(1/2), 134–139 (1918)
Article Google Scholar
Kim, N.S., Lim, W., Stern, R.: Feature compensation based on switching linear dynamic model. IEEE Signal Processing Letters 12(6), 473–476 (2005)
Article Google Scholar
Krueger, A., Leutnant, V., Haeb-Umbach, R., Ackermann, M., Bloemer, J.: On the initialization of dynamic models for speech features. In: Proc. of ITG Fachtagung Sprachkommunikation. ITG, Bochum, Germany (2010)
Google Scholar
Leonard, R.: A database for speaker independent digit recognition. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 9, pp. 328–331. ICASSP, San Diego, California (1984)
Google Scholar
Liao, H., Gales, M.: Issues with uncertainty decoding for noise robust automatic speech recognition. Speech Commununication 50(4), 265–277 (2008)
Google Scholar
Martin, R., Lotter, T.: Optimal recursive smoothing of non-stationary periodograms. In: Proc. of International Workshop on Acoustic Echo and Noise Control (IWAENC). Darmstadt, Germany (2001)
Google Scholar
Morris, A., Barker, J., Bourlard, H.: From missing data to maybe useful data: Soft data modelling for noise robust ASR. In: Proc. of International Workshop on Innovation in Speech Processing (WISP), 06. Stratford-upon-Avon, England (2001)
Google Scholar
Murphy, K.P.: Switching Kalman filters. Tech. rep., U.C. Berkeley (1998)
Google Scholar
Paul, D.B., Baker, J.M.: The design for the Wall Street Journal-based CSR corpus. In: HLT ’91: Proceedings of the workshop on Speech and Natural Language, pp. 357–362. Association for Computational Linguistics, Morristown, NJ, USA (1992)
Google Scholar
Pearce, D., Hirsch, H.G.: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proc. of International Conference on Spoken Language Processing (ICSLP). Beijing, China (2000)
Google Scholar
Stouten, V., Van hamme, H., Wambacq, P.: Effect of phase-sensitive environment model and higher order VTS on noisy speech feature enhancement. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 433–436. Philadelphia, PA, USA (2005)
Google Scholar
Stouten, V., Van hamme, H., Wambacq, P.: Model-based feature enhancement with uncertainty decoding for noise robust ASR. Speech Commununication 48(11), 1502–1514 (2006). Robustness Issues for Conversational Interaction
Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book V3.4. Cambridge University Press, Cambridge, UK (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Communications Engineering, University of Paderborn, Warburger Straße 100, 33098, Paderborn, Germany
Volker Leutnant & Reinhold Haeb-Umbach

Authors

Volker Leutnant
View author publications
You can also search for this author in PubMed Google Scholar
Reinhold Haeb-Umbach
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Volker Leutnant .

Editor information

Editors and Affiliations

Institute of Communication Acoustics, Ruhr-Universität Bochum, Universitätsstrasse 150, Bochum, 44801, Germany
Dorothea Kolossa
, Dept. of Communications Engineering, University of Paderborn, Warburger Strasse 100, Paderborn, 33098, Germany
Reinhold Häb-Umbach

Copyright information

About this chapter

Cite this chapter

Leutnant, V., Haeb-Umbach, R. (2011). Conditional Bayesian Estimation Employing a Phase-Sensitive Observation Model for Noise Robust Speech Recognition. In: Kolossa, D., Häb-Umbach, R. (eds) Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21317-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-21317-5_8
Published: 23 June 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21316-8
Online ISBN: 978-3-642-21317-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics