Speech Recognition with Microphone Arrays

Omologo, Maurizio; Matassoni, Marco; Svaizer, Piergiorgio

doi:10.1007/978-3-662-04619-7_15

Maurizio Omologo⁵,
Marco Matassoni⁵ &
Piergiorgio Svaizer⁵

Part of the book series: Digital Signal Processing ((DIGSIGNAL))

2101 Accesses
19 Citations

Abstract

Microphone arrays can be advantageously employed in Automatic Speech Recognition (ASR) systems to allow distant-talking interaction. Their beam-forming capabilities are used to enhance the speech message, while attenuating the undesired contribution of environmental noise and reverberation. In the first part of this chapter the state of the art of ASR systems is briefly reviewed, with a particular concern about robustness in distant-talker applications. The objective is the reduction of the mismatch between the real noisy data and the acoustic models used by the recognizer. Beamforming, speech enhancement, feature compensation, and model adaptation are the techniques adopted to this end. The second part of the chapter is dedicated to the description of a microphone-array based speech recognition system developed at ITC-IRST. It includes a linear array beamformer, an acoustic front-end for speech activity detection and feature extraction, a recognition engine based on Hidden Markov Models and the modules for training and adaptation of the acoustic models. Finally the performance of this system on a typical recognition task is reported.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

L.R. Rabiner, B.H. Juang, Fundamentals of speech recognition, Prentice Hall, 1993.
Google Scholar
R. De Mori, Spoken dialogues with computers, Academic Press, 1998.
Google Scholar
A. Acero, Acoustical and environmental robustness in automatic speech recognition, Kluwer, 1992.
Google Scholar
Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Communication, vol. 16, pp. 261–291, 1995.
Article Google Scholar
J.C. Junqua and J.P. Haton, Robustness in automatic speech recognition. Kluwer, 1996.
Google Scholar
C.H. Lee, F.K. Soong, and K.K. Paliwal, Automatic speech and speaker recognition. Kluwer, 1996.
Book Google Scholar
S. Furui, “Recent advances in robust speech recognition,” in Proc. of ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, pp. 11–20, 1997.
Google Scholar
M. Omologo, P. Svaizer, and M. Matassoni, “Environmental conditions and acoustic transduction in hands-free speech recognition,” Speech Communication, vol. 25, pp. 75–95, 1998.
Article Google Scholar
J. C. Junqua, “The lombard reflex and its role on human listeners and automatic speech recognizers,” J. Acoust. Soc. Am., vol. 93, pp. 510–524, 1993.
Article Google Scholar
L.R. Rabiner and R.W. Schafer, Digital processing of speech signals Prentice Hall, 1978.
Google Scholar
L.R. Rabiner and M. Sambur, “An algorithm for determining the endpoints of isolated utterances,” Bell Sys. Tech. Journal, vol. 54, no. 2, pp. 297–315, 1975.
Google Scholar
L.F. Lamel, L.R. Rabiner, A.E. Rosenberg, and J.G. Wilpon, “An improved endpoint detector for isolated word recognition,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 29, pp. 777–785, 1981.
Article Google Scholar
H. Ney, “An optimization algorithm for determining the endpoints of isolated utterances,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-81), Atlanta GA, USA, pp. 720–723, 1981.
Chapter Google Scholar
J.C. Junqua, B. Mak, and B. Reaves, “A robust algorithm for word boundary detection in the presence of noise,” IEEE Trans. on Speech and Audio Processing, vol. 2, no. 3, pp. 406–412, 1994.
Article Google Scholar
D. O’Shaughnessy, Speech Communications, IEEE Press, 2000.
Google Scholar
L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–286, 1989.
Article Google Scholar
C. H. Lee, “On stochastic feature and model compensation approaches to robust speech recognition,” Speech Communication, vol. 25, pp. 29–47, 1998.
Article Google Scholar
J.S. Lim, Speech Enhancement, Prentice Hall, 1983.
Google Scholar
Y. Ephraim, “Gain-adapted hidden Markov models for recognition of clean and noisy speech,” IEEE Trans. on Signal Processing, vol. 40, pp. 1303–1316, 1992.
Article MATH Google Scholar
S. V. Vaseghi, Advanced Signal Processing and Digital Noise Reduction Wiley and Teubner, 1996.
Google Scholar
S. Boll, “Speech enhancement in the 1980s, Noise suppression with pattern matching,” in Advances in Speech Signal Processing, (S. Furui and M.M. Sondhi, eds.), pp. 309–325, Marcel Dakker, 1992.
Google Scholar
M. Rahim and B.H. Juang, “Signal bias removal by maximum likelihood estimation for robust telephone speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 4, pp. 19–30, 1996.
Article Google Scholar
C. Lawrence and M. Rahim, “Integrated bias removal techniques for robust speech recognition,” Computer Speech and Language, vol. 13, pp. 283–298, 1999.
Article Google Scholar
A. Sankar and C.H. Lee, “A maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 4, pp. 190–202, 1996.
Article Google Scholar
A. Nadas, D. Nahamoo, and M. Picheny, “Speech recognition using noise-adaptive prototypes,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 37, no. 10, pp. 1495–1503, 1989.
Article Google Scholar
I. Sanches, “Noise-compensated hidden Markov models,” IEEE Trans. on Speech and Audio Processing, vol. 8, no. 5, pp. 533–540, 2000.
Article Google Scholar
Y. Zhao, “Frequency-domain maximum likelihood estimation for automatic speech recognition in additive and convolutive noises,” IEEE Trans. on Speech and Audio Processing, vol. 8, no. 3, pp. 255–266, 2000.
Article Google Scholar
M.J.F. Gales, Model-based techniques for noise robust speech recognition, PhD thesis, Cambridge University, Cambridge, England, 1995.
Google Scholar
M. J. F. Gales and S. J. Young, “Robust speech recognition using parallel model combination,” IEEE Trans. on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, 1996.
Article Google Scholar
J. L. Gauvain and C. H. Lee, “Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains,” IEEE Trans. on Speech and Audio Processing, vol. 2, no. 2, pp. 291–298, 1994.
Article Google Scholar
C.J. Leggetter and P.C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Computer Speech and Language, vol. 9, pp. 171–185, 1995.
Article Google Scholar
M. J. F. Gales and P. C. Woodland, “Mean and variance adaptation within the MLLR framework,” Computer Speech and Language, vol. 10, pp. 249–264, 1996.
Article Google Scholar
M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, pp. 75–98, 1998.
Article Google Scholar
S. Das, R. Bakis, A. Nadas, D. Nahamoo, and M. Picheny, “Influence of background noise and microphone on the performance of the ibm tangora speech recognition system,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-93), Minneapolis MN, USA, pp. 95–98, Apr. 1993.
Google Scholar
B. A. Dautrich, L. R. Rabiner, and T. B. Martin, “On the effects of varying filter bank parameters on isolated word recognition,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 31, no. 4, pp. 793–897, 1983.
Article Google Scholar
S. Furui, “Robust speech recognition under adverse conditions,” in Proc. ESCA Workshop on Speech Processing in Adverse Conditions, pp. 31–42, 1992.
Google Scholar
C.H. Knapp and G.C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976.
Article Google Scholar
M. Omologo and P. Svaizer, “Use of the cross-power-spectrum phase in acoustic event location,” IEEE Trans. on Speech and Audio Processing, vol. 5, no. 3, pp. 288–292, 1997.
Article Google Scholar
Y. Kaneda and J. Ohga, “Adaptive microphone-array system for noise reduction,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 34, no. 6, pp. 1391–1400, 1986.
Article Google Scholar
R. Zelinski, “A microphone array with adaptive post-filtering for noise reduction in reverberant rooms,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-88), New York NY, USA, pp. 2578–2581, Apr. 1988.
Google Scholar
S. Haykin, ed., Advances in spectrum analysis and array processing. Prentice Hall, 1995.
Google Scholar
M. W. Hoffman and K. M. Buckley, “Robust time-domain processing of broadband microphone array data,” IEEE Trans. on Speech and Audio Processing, vol. 3, no. 3, pp. 193–203, 1995.
Article Google Scholar
S. Fischer and K. U. Simmer, “Beamforming microphone arrays for speech acquisition in noisy environments,” Speech Communication, vol. 20, no. 3–4, pp. 215–27, 1996.
Article Google Scholar
O.L. Frost, “An algorithm for linearly constrained adaptive array processing,” Proc. of IEEE, vol. 60, no. 8, pp. 926–935, 1972.
Article Google Scholar
L.J. Griffiths and C.W. Jim, “An alternative approach to linearly constrained adaptive beamforming,” IEEE Trans. on Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982.
Article Google Scholar
J. Bitzer, K.U. Simmer, and K.D. Kammeyer, “Multi-microphone noise reduction techniques for hands-free speech recognition–a comparative study,” in Proc. of the Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland, pp. 171–174, 1999.
Google Scholar
J.L. Flanagan, A.C. Surendran, and E.E. Jan, “Spatially selective sound capture for speech and audio processing,” Speech Communication, vol. 13, pp. 207222, 1993.
Google Scholar
E.E. Jan, P. Svaizer, and J.L. Flanagan, “Matched-filter processing of microphone array for spatial volume selectivity,” in Proc. of IEEE ISCAS, pp. 14601463, 1995.
Google Scholar
C. Marro, Y. Mahieux, and K.U. Simmer, “Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering,” IEEE Trans. Speech and Audio Proc., vol. 6, no. 3, pp. 240–259, 1998.
Article Google Scholar
D. Van Compernolle, “Switching adaptive filters for enhancing noisy and reverberant speech from microphone array recordings,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-90), Albuquerque NM, USA, pp. 833–836, Apr. 1990.
Google Scholar
Y. Grenier, “A microphone array for car environments,” Speech Communication, vol. 12, pp. 25–39, 1993.
Article Google Scholar
T.M. Sullivan and R.M. Stern, “Multi-microphone correlation-based processing for robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-93), Minneapolis MN, USA, pp. 91–94, Apr. 1993.
Google Scholar
P. Raghavan, R.J. Renomeron, C. Che, D.S. Yuk, and J.L. Flanagan, “Speech recognition in a reverberant environment using matched filter array (MFA) processing and linguistic-tree maximum likelihood linear regression (LTMLLR) adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-99), Phoenix AZ, USA, pp. 777–780, Mar. 1999.
Google Scholar
T.B. Hughes, H.S. Kim, J.H. DiBiase, and H.F. Silverman, “Performance of an HMM Speech Recognizer using a real-time tracking microphone array as input,” IEEE Trans. on Speech and Audio Proc., vol. 7, no. 3, pp. 346–349, 1999.
Article Google Scholar
T. Takiguchi, S. Nakamura, and K. Shikano, “Speech recognition for a distant moving speaker based on HMM composition and separation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-00), Istanbul, Turkey, pp. 1403–1406, June 2000.
Google Scholar
J. Kleban and Y. Gong, “HMM adaptation and microphone array processing for distant speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-00), Istanbul, Turkey, pp. 1411–1414, June 2000.
Google Scholar
D. V. Rabinkin, R. J.Renomeron, J. C. French, and J. L. Flanagan, “Optimum microphone placement for array sound capture,” Proc. of the SPIE, vol. 3162, pp. 227–39, 1997.
Article Google Scholar
M. Inoue, S. Nakamura, T. Yamada, and K Shikano, “Microphone array design measures for hands-free speech recognition,” in Proc. of EUROSPEECH, pp. 331–334, 1997.
Google Scholar
D. Giuliani, M. Matassoni, M. Omologo, and P. Svaizer, “Use of different microphone array configurations for hands-free speech recognition in noisy and reverberant environments,” in Proc. of EUROSPEECH, pp. 347–350, 1997.
Google Scholar
M. Omologo, M. Matassoni, P. Svaizer, and D. Giuliani, “Microphone array based speech recognition with different talker-array positions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-97), Munich, Germany, pp. 227–230, Apr. 1997.
Google Scholar
S. Nakamura, T. Yamada, P. Heracleous, and K. Shikano, “Recognition of distant-talking speech based on 3-D trellis search using a microphone array and adaptive beamforming,” in Proc. of the Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland, pp. 219–222, 1999.
Google Scholar
S. Oh, and V. Viswanathan, “Hands-free voice communication in an automobile with a microphone array,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-92), San Francisco CA, USA, pp. 281–284, Mar. 1992.
Google Scholar
R. Le Bouquin, “Enhancement of noisy speech signals, application to mobile radio communications,” Speech Communication, vol. 18, pp. 3–19, 1996.
Article Google Scholar
D. Mansour and B.H. Juang, “The short-time modified coherence representation and noisy speech recognition,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 37, no. 6, pp. 795–804, 1989.
Article Google Scholar
B. Yegnanarayana, P. Satyanarayana Murthy, C. Avendano, and H. Herman-sky, “Enhancement of reverberant speech using LP residual,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-98), Seattle WA, USA, pp. 405–408, May 1998.
Google Scholar
M. Brandstein, “On the use of explicit speech modeling in microphone array applications,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-98), Seattle WA, USA pp. 3613–3616, May 1998.
Google Scholar
M. Brandstein, “An event-based method for microphone array speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP99), Phoenix AZ, USA, pp. 953–956, Mar. 1999.
Google Scholar
D. Giuliani, M. Matassoni, M. Omologo, and P. Svaizer, “Training of HMM with filtered speech material for hands-free speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-99), Phoenix AZ, USA, pp. 449–452, Mar. 1999.
Google Scholar
Y. Shimizu, S. Kajita, K. Takeda, and F. Itakura, “Speech recognition based on space diversity using distributed multi-microphone,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-00), Istanbul, Turkey, pp. 197–200, June 2000.
Google Scholar
D. Giuliani, M. Omologo, and P. Svaizer, “Experiments of speech recognition in a noisy and reverberant environment using a microphone array and HMM adaptation,” in Proc. of ICSLP, pp. 1329–1332, 1996.
Google Scholar
Q. Lin, C.W. Che, D.S. Yuk, L. Jin, B. de Vries, J. Pearson, and J.L. Flanagan, “Robust distant-talking speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-96), Atlanta GA, USA, pp. 21–24, May 1996.
Google Scholar
C. Che, Q. Lin, J. Pearson, B. de Vries, and J.L. Flanagan, “Microphone arrays and neural networks for robust speech recognition,” in Proc. ARPA Human Language Technology (HLT), pp. 342–348, 1994.
Chapter Google Scholar
W. Ward, G. Elko, R. Kubli, and W. McDougald, “The new varechoic chamber at AT andT Bell Labs,” in Proc. of Wallance Clement Sabine Centennial Symposium, pp. 343–346, 1994.
Google Scholar
N. Aoshima, “Computer-generated pulse signal applied for sound measurement,” J. Acoust. Soc. Am., vol. 69, no. 5, pp. 1484–1488, 1981.
Article Google Scholar
Y. Suzuki, F. Asano, H. Y. Kim, and T. Sone, “An optimum computer-generated pulse signal suitable for the measurement of very long impulse responses,” J. Acoust. Soc. Am., vol. 97, no. 2, pp. 1119–1123, 1995.
Article Google Scholar
J.B. Allen and D.A. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am., vol. 65, no. 4, pp. 943–950, 1979.
Article Google Scholar
D. Giuliani, M. Matassoni, M. Omologo, and P. Svaizer, “Experiments of HMM adaptation for hands-free connected digit recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-98), Seattle WA, USA, pp. 473476, May 1998.
Google Scholar
D. Giuliani, M. Matassoni, M. Omologo, and P. Svaizer, “Use of filtered clean speech for robust HMM training,” in Proc. of the Workshop on Robust Methods for Speech Recognition in Adverse Conditions, pp. 99–102, 1999.
Google Scholar
M. Matassoni, M. Omologo, and D. Giuliani, “Hands-free speech recognition using a filtered clean corpus and incremental HMM adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-00), Istanbul, Turkey, pp. 1407–1410, June 2000.
Google Scholar
B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, and M. Omologo, “Speaker independent continuous speech recognition using an acoustic-phonetic italian corpus,” in Proc. of ICSLP, pp. 1391–1394, 1994.
Google Scholar

Download references

Author information

Authors and Affiliations

ITC-IRST (Istituto per la Ricerca Scientifica e Tecnologica), Povo (Trento), Italy
Maurizio Omologo, Marco Matassoni & Piergiorgio Svaizer

Authors

Maurizio Omologo
View author publications
You can also search for this author in PubMed Google Scholar
Marco Matassoni
View author publications
You can also search for this author in PubMed Google Scholar
Piergiorgio Svaizer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Div. of Eng. and Applied Scciences, Harvard University, 33 Oxford Street, 02138, Cambridge, MA, USA
Michael Brandstein
Dept. of Electrical Engineering, Imperial College, Exhibition Road, SW7 2AZ, London, GB
Darren Ward

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Omologo, M., Matassoni, M., Svaizer, P. (2001). Speech Recognition with Microphone Arrays. In: Brandstein, M., Ward, D. (eds) Microphone Arrays. Digital Signal Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-04619-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-662-04619-7_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-07547-6
Online ISBN: 978-3-662-04619-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics