Abstract
During the last decade microphone array processing has emerged as a powerful tool for increasing the noise robustness of automatic speech recognition (ASR) systems. Typically,microphone arrays are used as preprocessors that enhance the incoming speech signal prior to recognition. While such traditional approaches can lead to good results, they usually require large numbers of microphones to reach acceptable performance in practice. Furthermore, important information, such as uncertainty estimates and energy bounds, are often ignored as speech recognition is conventionally performed only on the enhanced output of the array. Using the probabilistic concept of evidence modeling this chapter presents a novel approach to robust ASR that aims for closer integration of microphone array processing and missing data speech recognition in reverberant multi-speaker environments. The output of the array is used to estimate the probability density function (pdf) of the hidden clean speech data using any information which may be available before and after array processing. The chapter discusses different types of evidence pdfs and shows how these models can be used effectively during HMM decoding.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Araki, S., Mukai, R., Makino, S., Nishikawa, T., Saruwatari, H.: The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech. IEEE Transactions on Speech and Audio Processing 11(2), 109–116 (2003)
Araki, S., Sawada, H., Mukai, R., Makino, S.: DOA estimation for multiple sparse sources with normalized observation vector clustering. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Toulouse, France (2006)
Araki, S., Sawada, H., Mukai, R., Makino, S.: Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Processing 87(8), 1833–1847 (2007)
Arrowood, J.: Using observation uncertainty for robust speech recognition. Ph.D. thesis, Georgia Institute of Technology (2003)
Barker, J., Josifovski, L., Cooke, M., Green, P.: Soft decisions in missing data techniques for robust automatic speech recognition. In: 6th International Conference of Spoken Language Processing. Beijing, China (2000)
Benìtez, M., Segura, J., Ramìrez, J., Rubio, A.: Including uncertainty of speech observations in robust speech recognition. In: 8th International Conference on Spoken Language Processing. Jeju Island, Korea (2004)
Bregman, A.: Auditory Scene Analysis. MIT Press, Cambridge MA (1990)
Cermak, J., Araki, S., Sawada, H., Makino, S.: Blind speech separation by combining beamformers and a time frequency binary mask. In: International Workshop on Acoustic Echo and Noise Control. Paris, France (2006)
Cherry, E.: Some experiments on the recognition of speech, with one and with two ears. Journal of the Acoustical Society of America 25(5), 975–979 (1953)
Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Communication 34(3), 267–285 (2001)
Deng, L., Droppo, J., Acero, A.: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Transactions on Speech and Audio Processing 13(3), 412–421 (2005)
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N., Zue, V.: TIMIT acoustic-phonetic continuous speech corpus. Tech. rep., Linguistic Data Consortium (1993)
Harding, S., Barker, J., Brown, G.: Mask estimation for missing data speech recognition based on statistics of binaural interaction. IEEE Transactions on Audio, Speech, and Language Processing 14(1), 58–67 (2006)
Kolossa, D., Klimas, A., Orglmeister, R.: Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques (2005)
Kolossa, D., Orglmeister, R.: Nonlinear postprocessing for blind speech separation. In: 5th International Conference on Independent Component Analysis and Signal Separation. Granada, Spain (2004)
Kolossa, D., Sawada, H., Astudillo, R., Orglmeister, R., Makino, S.: Recognition of convolutive speech mixtures by missing feature techniques for ICA. In: Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA (2006)
Kühne, M., Pullella, D., Togneri, R., Nordholm, S.: Towards the use of full covariance models for missing data speaker recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Las Vegas, USA (2008)
Kühne, M., Togneri, R., Nordholm, S.: Mel-spectrographic mask estimation for missing data speech recognition using short-time-Fourier-transform ratio estimators. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA (2007)
Kühne, M., Togneri, R., Nordholm, S.: Adaptive beamforming and soft missing data decoding for robust speech recognition in reverberant environments. In: Interspeech. Brisbane, Australia (2008)
Kühne, M., Togneri, R., Nordholm, S.: Time-frequency masking: Linking blind source separation and robust speech recognition. In: F. Milhelič, J. Žibert (eds.) Speech Recognition: Techniques, Technologies and Applications, pp. 61–80. In-Tech Open Access Publisher (2008)
Kühne, M., Togneri, R., Nordholm, S.: Robust source localization in reverberant environments based on weighted fuzzy clustering. IEEE Signal Processing Letters 16(2), 85–88 (2009)
Kühne, M., Togneri, R., Nordholm, S.: A new evidence model for missing data speech recognition with applications in reverberant multi-source environments. IEEE Transactions on Audio, Speech and Language Processing, in press (2010)
Kühne, M., Togneri, R., Nordholm, S.: A novel fuzzy clustering algorithm using observation weighting and context information for reverberant blind speech separation. Signal Processing 90(2), 653–669 (2010)
Lehmann, E., Johansson, A.: Prediction of energy decay in room impulse responses simulated with an image-source model. Journal of the Acoustical Society of America 124(1), 269–277 (2008)
Leonard, R.: A database for speaker-independent digit recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. San Diego, CA (1984)
Lippmann, R.: Speech recognition by machines and humans. Speech Communication 22(1), 1–15 (1997)
Low, S., Togneri, R., Nordholm, S.: Spatio-temporal processing for distant speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Montreal, Canada (2004)
Malonakis, D., Ingle, V., Kogon, S.: Statistical and Adaptive Signal Processing. McGraw Hill (2000)
McAdams, S.: Recognition of Auditory Sound Sources and Events. Thinking in Sound: The Cognitive Psychology of Human Audition. Oxford University Press (1993)
McCowan, I.A., Marro, C., Mauuary, L.: Robust speech recognition using near-field superdirective beamforming with post-filtering. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Istanbul, Turkey (2000)
McCowan, I.A., Morris, A., Bourlard, H.: Improving speech recognition performance of small microphone arrays using missing data techniques. In: 7th International Conference on Spoken Language Processing. Denver, USA (2002)
Morris, A.: Data utility modelling for mismatch reduction. In: Workshop on Consistent & Reliable Acoustic Cues for sound analysis. Aalborg, Denmark (2001)
Morris, A., Barker, J., Bourlard, H.: From missing data to maybe useful data: Soft data modelling for noise robust ASR. In: WISP. Stratford-upon-Avon, England (2001)
Omologo, M., Matassoni, M., Svaizer, P.: Speech recognition with microphone arrays. In: M. Brandstein, D. Ward (eds.) Microphone arrays, pp. 331–353. Springer (2001)
Palomäki, K., Brown, G., Wang, D.: A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation. Speech Communication 43(4), 361–378 (2004)
Roman, N., Srinivasan, S., Wang, D.: Speech recognition in multisource reverberant environments with binaural inputs. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Toulouse, France (2006)
Roman, N., Wang, D.: Pitch-based monaural segregation of reverberant speech. Journal of the Acoustical Society of America 120(1), 458–469 (2006)
Roman, N., Wang, D., Brown, G.: Speech segregation based on sound localization. Journal of the Acoustical Society of America 114(4), 2236–2252 (2003)
Seltzer, M.: Microphone array processing for robust speech recognition. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, USA (2003)
Srinivasan, S., Roman, N., Wang, D.: Exploiting uncertainties for binaural speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Honolulu, USA (2007)
Stouten, V., Van hamme, H., Wambacq, P.: Accounting for the uncertainty of speech estimates in the context of model-based feature enhancement. In: International Conference on Spoken Language Processing. Jeju Island, Korea (2004)
Togami, M., Sumiyoshi, T., Amano, A.: Stepwise phase difference restoration method for sound source localization using multiple microphone pairs. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA (2007)
de Veth, J., de Wet, F., Cranen, B., Boves, L.: Acoustic features and a distance measure that reduces the impact of training-set mismatch in ASR. Speech Communication 34(1-2), 57–74 (2001)
Wu, M., Wang, D.: A one-microphone algorithm for reverberant speech enhancement. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 892–895. Hong Kong, China (2003)
Yilmaz, Ö., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing 52(7), 1830–1847 (2004)
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., J., O., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University Engineering Department (2006)
Acknowledgements
This research was partly funded by the Australian Research Council (ARC) grant no. DP1096348.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Kühne, M., Togneri, R., Nordholm, S. (2011). Evidence Modeling for Missing Data Speech Recognition Using Small Microphone Arrays. In: Kolossa, D., Häb-Umbach, R. (eds) Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21317-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-21317-5_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21316-8
Online ISBN: 978-3-642-21317-5
eBook Packages: EngineeringEngineering (R0)