Abstract
Automatic speech recognition (ASR) is a critical component for CHIL services. For example, it provides the input to higher-level technologies, such as summarization and question answering, as discussed in Chapter 8. In the spirit of ubiquitous computing, the goal of ASR in CHIL is to achieve a high performance using far-field sensors (networks of microphone arrays and distributed far-field microphones). However, close-talking microphones are also of interest, as they are used to benchmark ASR system development by providing a best-case acoustic channel scenario to compare against.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
AMI – Augmented Multiparty Interaction, http://www.amiproject.org.
T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul. A compact model for speaker adaptation training. In International Conference on Spoken Language Processing (ICSLP), pages 1137–1140, Philadelphia, PA, 1996.
A. Andreou, T. Kamm, and J. Cohen. Experiments in vocal tract normalisation. In Proceedings of the CAIP Workshop: Frontiers in Speech Recognition II, 1994.
X. Anguera, C. Wooters, and J. Hernando. Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2011–2022, 2007.
C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain. Multi-stage speaker diarization of Broadcast News. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1505–1512, 2006.
R. E. Bellman. Adaptive Control Processes. Princeton University Press, Princeton, NJ, 1961.
S. Burger, V. McLaren, and H. Yu. The ISL meeting corpus: The impact of meeting type on speech style. In Proceedings of the International Conference on Spoken Language Processing, Denver, CO, 2002.
S. M. Chu, E. Marcheret, and G. Potamianos. Automatic speech recognition and speech activity detection in the CHIL seminar room. In Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI), pages 332–343, Edinburgh, United Kingdom, 2005.
S. Davis and P. Mermelstein. Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, 1980.
A. P. Dempster, M. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39:1–38, 1977.
F. Faubel and M. Wölfel. Coupling particle filters with automatic speech recognition for speech feature enhancement. In Proceedings of Interspeech, 2006.
J. G. Fiscus. A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). In Automatic Speech Recognition and Understanding Workshop (ASRU), pages 347–352, Santa Barbara, CA, 1997.
J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo. The Rich Transcription 2006 Spring meeting recognition evaluation. In S. Renals, S. Bengio, and J. G. Fiscus, editors, Machine Learning for Multimodal Interaction, LNCS 4299, pages 309–322. 2006.
J. S. Garofolo, C. D. Laprun, M. Michel, V. M. Stanford, and E. Tabassi. The NIST meeting room pilot corpus. In Proceedings of the Language Resources Evaluation Conference, Lisbon, Portugal, May 2004.
J.-L. Gauvain and C. Lee. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2):291–298, Apr. 1994.
R. Gopinath. Maximum likelihood modeling with Gaussian distributions for classification. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 661–664, Seattle, WA, 1998.
R. Haeb-Umbach and H. Ney. Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 13–16, 1992.
H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustic Society of America, 87(4):1738–1752, 1990.
R. Hincks. Computer Support for Learners of Spoken English. PhD thesis, KTH, Stockholm, Sweden, 2005.
J. Huang, E. Marcheret, and K. Visweswariah. Improving speaker diarization for CHIL lecture meetings. In Proceedings of Interspeech, pages 1865–1868, Antwerp, Belgium, 2007.
J. Huang, E. Marcheret, K. Visweswariah, V. Libal, and G. Potamianos. The IBM Rich Transcription 2007 speech-to-text systems for lecture meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 429–441, Baltimore, MD, May 8-11 2007.
J. Huang, E. Marcheret, K. Visweswariah, and G. Potamianos. The IBM RT07 evaluation systems for speaker diarization on lecture meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 497–508, Baltimore, MD, May 8-11 2007.
J. Huang, M. Westphal, S. Chen, O. Siohan, D. Povey, V. Libal, A. Soneiro, H. Schulz, T. Ross, and G. Potamianos. The IBM Rich Transcription Spring 2006 speech-to-text system for lecture meetings. In Machine Learning for Multimodal Interaction, pages 432–443. LNCS 4299, 2006.
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C.Wooters. The ICSI meeting corpus. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, 2003.
N. S. Kim. Feature domain compensation of nonstationary noise for robust speech recognition. Speech Communication, 37:231–248, 2002.
K. Kumatani, T. Gehrig, U. Mayer, E. Stoimenov, J. McDonough, and M. Wölfel. Adaptive beamforming with a minimum mutual information criterion. IEEE Transactions on Audio, Speech, and Language Processing, 15:2527–2541, 2007.
K. Kumatani, S. Nakamura, and R. Stiefelhagen. Asynchronous event modeling algorithm for bimodal speech recognition. Speech Communication, 2008. (submitted to).
K. Kumatani and R. Stiefelhagen. State-synchronous modeling on phone boundary for audio visual speech recognition and application to multi-view face images. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 417–420, Honolulu, HI, 2007.
L. Lamel, J. L. Gauvain, and G. Adda. Lightly supervised and unsupervised acoustic model training. Computer, Speech and Language, 16(1):115–229, 2002.
H. Tillmann. The translanguage English database (TED). In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Yokohama, Japan, 1994.
The LDC Corpus Catalog. http://www.ldc.upenn.edu/Catalog.
C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9(2):171–185, 1995.
P. Lucey, G. Potamianos, and S. Sridharan. A unified approach to multi-pose audio-visual ASR. In Interspeech, Antwerp, Belgium, 2007.
J. Luque, X. Anguera, A. Temko, and J. Hernando. Speaker diarization for conference room: The UPC RT07s evaluation system. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 543–554, Baltimore, MD, May 8-11 2007.
D. Macho, C. Nadeu, and A. Temko. Robust speech activity detection in interactive smartroom environments. In Machine Learning for Multimodal Interaction, LNCS 4299, pages 236–247. 2006.
L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: Word error minimization and other applications of confusion networks. Computer, Speech and Lanuage, 14(4):373–400, 2000.
E. Marcheret, V. Libal, and G. Potamianos. Dynamic stream weight modeling for audiovisual speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 945–948, Honolulu, HI, 2007.
E. Marcheret, G. Potamianos, K. Visweswariah, and J. Huang. The IBM RT06s evaluation system for speech activity detection in CHIL seminars. In Machine Learning for Multimodal Interaction, LNCS 4299, pages 323–335. 2006.
J. M. Pardo, X. Anguera, and C. Wooters. Speaker diarization for multi-microphone meetings using only between-channel differences. In Machine Learning for Multimodal Interaction, pages 257–264. LNCS 4299, 2006.
D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig. fMPE: Discriminatively trained features for speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 961–964, Philadelphia, PA, 2005.
D. Povey and P. Woodland. Improved discriminative training techniques for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City, UT, 2001.
D. Povey and P. C. Woodland. Minimum phone error and I-smoothing for improved discriminative training. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 105–108, Orlando, FL, 2002.
E. Rentzeperis, A. Stergiou, C. Boukis, A. Pnevmatikakis, and L. C. Polymenakos. The 2006 Athens Information Technology speech activity detection and speaker diarization systems. In Machine Learning for Multimodal Interaction, pages 385–395. LNCS 4299, 2006.
C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, New York, 2nd edition, 2004.
Rich Transcription 2007 Meeting Recognition Evaluation. http://www.nist.gov/speech/tests/rt/rt2007.
G. Saon, G. Zweig, and M. Padmanabhan. Linear feature space projections for speaker adaptation. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 325–328, Salt Lake City, UT, 2001.
H. Schwenk. Efficient training of large neural networks for language modeling. In Proceedings of the International Joint Conference on Neural Networks, pages 3059–3062, 2004.
R. Singh and B. Raj. Tracking noise via dynamical systems with a continuum of states. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, 2003.
O. Siohan, B. Ramabhadran, and B. Kingsbury. Constructing ensembles of ASR systems using randomized decision trees. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 197–200, Philadelphia, PA, 2005.
The Translanguage English Database (TED) Transcripts (LDC catalog number LDC2002T03, ISBN 1-58563-202-3).
S. E. Tranter and D. A. Reynolds. An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1557–1565, 2004.
M. Wölfel. Warped-twice minimum variance distortionless response spectral estimation. In Proceedings of the European Signal Processing Conference (EUSIPCO), 2006.
M. Wölfel. Channel selection by class separability measures for automatic transcriptions on distant microphones. In Proceedings of Interspeech, 2007.
M. Wölfel. A joint particle filter and multi-step linear prediction framework to provide enhanced speech features prior to automatic recognition. In Joint Workshop on Handsfree Speech Communication and Microphone Arrays (HSCMA), Trento, Italy, 2008.
M. Wölfel and J. McDonough. Combining multi-source far distance speech recognition strategies: Beamforming, blind channel and confusion network combination. In Proceedings of Interspeech, 2005.
M. Wölfel and J. W. McDonough. Minimum variance distortionless response spectral estimation, review and refinements. IEEE Signal Processing Magazine, 22(5):117–126, 2005.
M. Wölfel, K. Nickel, and J. McDonough. Microphone array driven speech recognition: influence of localization on the word error rate. Proceedings of the Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI), 2005.
M. Wölfel, S. Stüker, and F. Kraft. The ISL RT-07 speech-to-text system. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 464–474, Baltimore, MD, May 8-11 2007.
X. Zhu, C. Barras, L. Lamel, and J. L. Gauvain. Speaker diarization: from Broadcast News to lectures. In Machine Learning for Multimodal Interaction, pages 396–406. LNCS 4299, 2006.
X. Zhu, C. Barras, L. Lamel, and J.-L. Gauvain. Multi-stage speaker diarization for conference and lecture meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 533–542, Baltimore, MD, May 8-11 2007.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag London Limited
About this chapter
Cite this chapter
Potamianos, G. et al. (2009). Automatic Speech Recognition. In: Waibel, A., Stiefelhagen, R. (eds) Computers in the Human Interaction Loop. Human–Computer Interaction Series. Springer, London. https://doi.org/10.1007/978-1-84882-054-8_6
Download citation
DOI: https://doi.org/10.1007/978-1-84882-054-8_6
Publisher Name: Springer, London
Print ISBN: 978-1-84882-053-1
Online ISBN: 978-1-84882-054-8
eBook Packages: Computer ScienceComputer Science (R0)