Automatic Speech Recognition | SpringerLink

Gerasimos Potamianos⁴,
Lori Lamel⁵,
Matthias Wölfel⁶,
Jing Huang⁴,
Etienne Marcheret⁴,
Claude Barras⁵,
Xuan Zhu⁵,
John McDonough⁶,
Javier Hernando⁷,
Dusan Macho⁷ &
…
Climent Nadeu⁷

Part of the book series: Human–Computer Interaction Series ((HCIS))

987 Accesses

Abstract

Automatic speech recognition (ASR) is a critical component for CHIL services. For example, it provides the input to higher-level technologies, such as summarization and question answering, as discussed in Chapter 8. In the spirit of ubiquitous computing, the goal of ASR in CHIL is to achieve a high performance using far-field sensors (networks of microphone arrays and distributed far-field microphones). However, close-talking microphones are also of interest, as they are used to benchmark ASR system development by providing a best-case acoustic channel scenario to compare against.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Similar content being viewed by others

Toolkits for Robust Speech Processing

Chapter © 2017

Automatic Speech Recognition

Chapter © 2020

Challenges and Issues in Adopting Speech Recognition

Chapter © 2018

References

AMI – Augmented Multiparty Interaction, http://www.amiproject.org.
T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul. A compact model for speaker adaptation training. In International Conference on Spoken Language Processing (ICSLP), pages 1137–1140, Philadelphia, PA, 1996.
Google Scholar
A. Andreou, T. Kamm, and J. Cohen. Experiments in vocal tract normalisation. In Proceedings of the CAIP Workshop: Frontiers in Speech Recognition II, 1994.
Google Scholar
X. Anguera, C. Wooters, and J. Hernando. Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2011–2022, 2007.
Article Google Scholar
C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain. Multi-stage speaker diarization of Broadcast News. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1505–1512, 2006.
Article Google Scholar
R. E. Bellman. Adaptive Control Processes. Princeton University Press, Princeton, NJ, 1961.
Google Scholar
S. Burger, V. McLaren, and H. Yu. The ISL meeting corpus: The impact of meeting type on speech style. In Proceedings of the International Conference on Spoken Language Processing, Denver, CO, 2002.
Google Scholar
S. M. Chu, E. Marcheret, and G. Potamianos. Automatic speech recognition and speech activity detection in the CHIL seminar room. In Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI), pages 332–343, Edinburgh, United Kingdom, 2005.
Google Scholar
S. Davis and P. Mermelstein. Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, 1980.
Article Google Scholar
A. P. Dempster, M. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39:1–38, 1977.
MATH MathSciNet Google Scholar
F. Faubel and M. Wölfel. Coupling particle filters with automatic speech recognition for speech feature enhancement. In Proceedings of Interspeech, 2006.
Google Scholar
J. G. Fiscus. A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). In Automatic Speech Recognition and Understanding Workshop (ASRU), pages 347–352, Santa Barbara, CA, 1997.
Google Scholar
J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo. The Rich Transcription 2006 Spring meeting recognition evaluation. In S. Renals, S. Bengio, and J. G. Fiscus, editors, Machine Learning for Multimodal Interaction, LNCS 4299, pages 309–322. 2006.
Google Scholar
J. S. Garofolo, C. D. Laprun, M. Michel, V. M. Stanford, and E. Tabassi. The NIST meeting room pilot corpus. In Proceedings of the Language Resources Evaluation Conference, Lisbon, Portugal, May 2004.
Google Scholar
J.-L. Gauvain and C. Lee. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2):291–298, Apr. 1994.
Article Google Scholar
R. Gopinath. Maximum likelihood modeling with Gaussian distributions for classification. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 661–664, Seattle, WA, 1998.
Google Scholar
R. Haeb-Umbach and H. Ney. Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 13–16, 1992.
Google Scholar
H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustic Society of America, 87(4):1738–1752, 1990.
Article Google Scholar
R. Hincks. Computer Support for Learners of Spoken English. PhD thesis, KTH, Stockholm, Sweden, 2005.
Google Scholar
J. Huang, E. Marcheret, and K. Visweswariah. Improving speaker diarization for CHIL lecture meetings. In Proceedings of Interspeech, pages 1865–1868, Antwerp, Belgium, 2007.
Google Scholar
J. Huang, E. Marcheret, K. Visweswariah, V. Libal, and G. Potamianos. The IBM Rich Transcription 2007 speech-to-text systems for lecture meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 429–441, Baltimore, MD, May 8-11 2007.
Google Scholar
J. Huang, E. Marcheret, K. Visweswariah, and G. Potamianos. The IBM RT07 evaluation systems for speaker diarization on lecture meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 497–508, Baltimore, MD, May 8-11 2007.
Google Scholar
J. Huang, M. Westphal, S. Chen, O. Siohan, D. Povey, V. Libal, A. Soneiro, H. Schulz, T. Ross, and G. Potamianos. The IBM Rich Transcription Spring 2006 speech-to-text system for lecture meetings. In Machine Learning for Multimodal Interaction, pages 432–443. LNCS 4299, 2006.
Google Scholar
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C.Wooters. The ICSI meeting corpus. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, 2003.
Google Scholar
N. S. Kim. Feature domain compensation of nonstationary noise for robust speech recognition. Speech Communication, 37:231–248, 2002.
Article MATH Google Scholar
K. Kumatani, T. Gehrig, U. Mayer, E. Stoimenov, J. McDonough, and M. Wölfel. Adaptive beamforming with a minimum mutual information criterion. IEEE Transactions on Audio, Speech, and Language Processing, 15:2527–2541, 2007.
Article Google Scholar
K. Kumatani, S. Nakamura, and R. Stiefelhagen. Asynchronous event modeling algorithm for bimodal speech recognition. Speech Communication, 2008. (submitted to).
Google Scholar
K. Kumatani and R. Stiefelhagen. State-synchronous modeling on phone boundary for audio visual speech recognition and application to multi-view face images. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 417–420, Honolulu, HI, 2007.
Google Scholar
L. Lamel, J. L. Gauvain, and G. Adda. Lightly supervised and unsupervised acoustic model training. Computer, Speech and Language, 16(1):115–229, 2002.
Article Google Scholar
H. Tillmann. The translanguage English database (TED). In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Yokohama, Japan, 1994.
Google Scholar
The LDC Corpus Catalog. http://www.ldc.upenn.edu/Catalog.
C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9(2):171–185, 1995.
Article Google Scholar
P. Lucey, G. Potamianos, and S. Sridharan. A unified approach to multi-pose audio-visual ASR. In Interspeech, Antwerp, Belgium, 2007.
Google Scholar
J. Luque, X. Anguera, A. Temko, and J. Hernando. Speaker diarization for conference room: The UPC RT07s evaluation system. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 543–554, Baltimore, MD, May 8-11 2007.
Google Scholar
D. Macho, C. Nadeu, and A. Temko. Robust speech activity detection in interactive smartroom environments. In Machine Learning for Multimodal Interaction, LNCS 4299, pages 236–247. 2006.
Google Scholar
L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: Word error minimization and other applications of confusion networks. Computer, Speech and Lanuage, 14(4):373–400, 2000.
Article Google Scholar
E. Marcheret, V. Libal, and G. Potamianos. Dynamic stream weight modeling for audiovisual speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 945–948, Honolulu, HI, 2007.
Google Scholar
E. Marcheret, G. Potamianos, K. Visweswariah, and J. Huang. The IBM RT06s evaluation system for speech activity detection in CHIL seminars. In Machine Learning for Multimodal Interaction, LNCS 4299, pages 323–335. 2006.
Google Scholar
J. M. Pardo, X. Anguera, and C. Wooters. Speaker diarization for multi-microphone meetings using only between-channel differences. In Machine Learning for Multimodal Interaction, pages 257–264. LNCS 4299, 2006.
Google Scholar
D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig. fMPE: Discriminatively trained features for speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 961–964, Philadelphia, PA, 2005.
Google Scholar
D. Povey and P. Woodland. Improved discriminative training techniques for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City, UT, 2001.
Google Scholar
D. Povey and P. C. Woodland. Minimum phone error and I-smoothing for improved discriminative training. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 105–108, Orlando, FL, 2002.
Google Scholar
E. Rentzeperis, A. Stergiou, C. Boukis, A. Pnevmatikakis, and L. C. Polymenakos. The 2006 Athens Information Technology speech activity detection and speaker diarization systems. In Machine Learning for Multimodal Interaction, pages 385–395. LNCS 4299, 2006.
Google Scholar
C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, New York, 2nd edition, 2004.
MATH Google Scholar
Rich Transcription 2007 Meeting Recognition Evaluation. http://www.nist.gov/speech/tests/rt/rt2007.
G. Saon, G. Zweig, and M. Padmanabhan. Linear feature space projections for speaker adaptation. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 325–328, Salt Lake City, UT, 2001.
Google Scholar
H. Schwenk. Efficient training of large neural networks for language modeling. In Proceedings of the International Joint Conference on Neural Networks, pages 3059–3062, 2004.
Google Scholar
R. Singh and B. Raj. Tracking noise via dynamical systems with a continuum of states. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, 2003.
Google Scholar
O. Siohan, B. Ramabhadran, and B. Kingsbury. Constructing ensembles of ASR systems using randomized decision trees. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 197–200, Philadelphia, PA, 2005.
Google Scholar
The Translanguage English Database (TED) Transcripts (LDC catalog number LDC2002T03, ISBN 1-58563-202-3).
Google Scholar
S. E. Tranter and D. A. Reynolds. An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1557–1565, 2004.
Article Google Scholar
M. Wölfel. Warped-twice minimum variance distortionless response spectral estimation. In Proceedings of the European Signal Processing Conference (EUSIPCO), 2006.
Google Scholar
M. Wölfel. Channel selection by class separability measures for automatic transcriptions on distant microphones. In Proceedings of Interspeech, 2007.
Google Scholar
M. Wölfel. A joint particle filter and multi-step linear prediction framework to provide enhanced speech features prior to automatic recognition. In Joint Workshop on Handsfree Speech Communication and Microphone Arrays (HSCMA), Trento, Italy, 2008.
Google Scholar
M. Wölfel and J. McDonough. Combining multi-source far distance speech recognition strategies: Beamforming, blind channel and confusion network combination. In Proceedings of Interspeech, 2005.
Google Scholar
M. Wölfel and J. W. McDonough. Minimum variance distortionless response spectral estimation, review and refinements. IEEE Signal Processing Magazine, 22(5):117–126, 2005.
Article Google Scholar
M. Wölfel, K. Nickel, and J. McDonough. Microphone array driven speech recognition: influence of localization on the word error rate. Proceedings of the Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI), 2005.
Google Scholar
M. Wölfel, S. Stüker, and F. Kraft. The ISL RT-07 speech-to-text system. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 464–474, Baltimore, MD, May 8-11 2007.
Google Scholar
X. Zhu, C. Barras, L. Lamel, and J. L. Gauvain. Speaker diarization: from Broadcast News to lectures. In Machine Learning for Multimodal Interaction, pages 396–406. LNCS 4299, 2006.
Google Scholar
X. Zhu, C. Barras, L. Lamel, and J.-L. Gauvain. Multi-stage speaker diarization for conference and lecture meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 533–542, Baltimore, MD, May 8-11 2007.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Gerasimos Potamianos, Jing Huang & Etienne Marcheret
LIMSI-CNRS, Orsay, France
Lori Lamel, Claude Barras & Xuan Zhu
Interactive Systems Labs, Fakultät für Informatik, Universität Karlsruhe (TH), Karlsruhe, Germany
Matthias Wölfel & John McDonough
TALP Research Center, Universitat Politèchnica de Catalunya, Barcelona, Spain
Javier Hernando, Dusan Macho & Climent Nadeu

Authors

Gerasimos Potamianos
View author publications
You can also search for this author in PubMed Google Scholar
Lori Lamel
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Wölfel
View author publications
You can also search for this author in PubMed Google Scholar
Jing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Etienne Marcheret
View author publications
You can also search for this author in PubMed Google Scholar
Claude Barras
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
John McDonough
View author publications
You can also search for this author in PubMed Google Scholar
Javier Hernando
View author publications
You can also search for this author in PubMed Google Scholar
Dusan Macho
View author publications
You can also search for this author in PubMed Google Scholar
Climent Nadeu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universität Karlsruhe (TH), Karlsruhe, Germany
Alexander Waibel & Rainer Stiefelhagen &

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag London Limited

About this chapter

Cite this chapter

Potamianos, G. et al. (2009). Automatic Speech Recognition. In: Waibel, A., Stiefelhagen, R. (eds) Computers in the Human Interaction Loop. Human–Computer Interaction Series. Springer, London. https://doi.org/10.1007/978-1-84882-054-8_6

Download citation

DOI: https://doi.org/10.1007/978-1-84882-054-8_6
Publisher Name: Springer, London
Print ISBN: 978-1-84882-053-1
Online ISBN: 978-1-84882-054-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics