Abstract
This paper describes the 2006 lecture and conference meeting speech-to-text system developed at the Interactive Systems Laboratories (ISL), for the individual head-mounted microphone (IHM), single distant microphone (SDM), and multiple distant microphone (MDM) conditions, which was evaluated in the RT-06S Rich Transcription Meeting Evaluation sponsored by the US National Institute of Standards and Technologies (NIST). We describe the principal differences between our current system and those submitted in previous years, namely improved acoustic and language models, cross adaptation between systems with different front-ends and phoneme sets, and the use of various automatic speech segmentation algorithms.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Fügen, C., Kolss, M., Bernreuther, D., Paulik, M., Stüker, S., Vogel, S., Waibel, A.: Open Domain Speech Recognition & Translation: Lectures and Speeches. In: ICASSP (2006)
Wölfel, M., McDonough, J.: Combining Multi-Source Far Distance Speech Recognition Strategies: Beamforming, Blind Channel and Confusion Network Combination. In: INTERSPEECH (2005)
Metze, F., Jin, Q., Fügen, C., Laskowski, K., Pan, Y., Schultz, T.: Issues in Meeting Transcription – The ISL Meeting Transcription System. In: ICSLP (2004)
Wölfel, M., McDonough, J.: Minimum Variance Distortionless Response Spectral Estimation Review and Refinements. IEEE Signal Processing Magazine (September 2005)
Stüker, S., Fügen, C., Burger, S., Wölfel, M.: Cross-System Adaptation and Combination for Continuous Speech Recognition: The Influence of Phoneme Set and Acoustic Front-End. In: INTERSPEECH (2006)
Jin, Q., Schultz, T.: Speaker Segmentation and Clustering in Meetings. In: ICSLP (2004)
Stüker, S., Fügen, C., Hsiao, R., Ikbal, S., Jin, Q., Kraft, F., Paulik, M., Raab, M.W.M., Tam, Y.-C.: The ISL TC-STAR Spring 2006 ASR Evaluation Systems. In: TC-Star Workshop on Speech-to-Speech Translation (2006)
Makhoul, J.: Linear Prediction: A Tutorial Review. Proc. of the IEEE 63(4), 561–580 (1975)
Fügen, C., Wölfel, M., McDonough, J.W., Ikbal, S., Kraft, F., Laskowski, K., Ostendorf, M., Stüker, S., Kumatani, K.: Advances in Lecture Recognition: The ISL RT-06S Evaluation System. In: INTERSPEECH (2006)
Pfau, T., Ellis, D.P.W., Stolcke, A.: Multispeaker Speech Activity Detection for the ICSI Meeting Recorder. In: Proc. ASRU (2001)
Wrigley, S.N., Brown, G.J., Wan, V., Renals, S.: Speech and Crosstalk Detection in Multichannel Audio. IEEE Trans. on Speech and Audio Processing 13, 84–91 (2005)
Laskowski, K., Schultz, T.: Unsupervised Learning of Overlapped Speech Model Parameters for Multichannel Speech Activity Detection in Meetings. In: Proc. ICASSP (2006)
Çetin, Ö., Shriberg, E.: Speaker Overlaps and ASR Errors in Meetings: Effects Before, During, and After the Overlap. In: Proc. ICASSP (2006)
Soltau, H., Metze, F., Fügen, C., Waibel, A.: A One Pass-Decoder Based on Polymorphic Linguistic Context Assignment. In: ASRU (2001)
Gales, M.J.F.: Semi-tied covariance matrices. In: ICASSP (1998)
McDonough, J., Schaaf, T., Waibel, A.: On Maximum Mutual Information Speaker-Adapted Training. In: ICASSP (2002)
Fisher, W.M.: A Statistical Text-to-Phone Function Using Ngrams and Rules. In: ICASSP (1999)
Stolcke, A.: SRILM – An Extensible Language Modeling Toolkit. In: ICSLP (2002)
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Computer Science Group, Harvard University, Tech. Rep. TR-10-98 (1998)
Bulyko, I., Ostendorf, M., Stolcke, A.: Getting more Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures. In: Proc. HLT-NAACL (2003)
Çetin, Ö., Stolcke, A.: Language Modeling in the ICSI-SRI Spring 2005 Meeting Speech Recognition Evaluation System. International Computer Science Institute, Berkeley, CA, USA, Tech. Rep. TR-05-006 (2005)
Venkataraman, A., Wang, W.: Techniques for Effective Vocabulary Selection. In: Proc. Eurospeech (2003)
Black, A.W., Taylor, P.A.: The Festival Speech Synthesis System: System documentation. Human Communciation Research Centre, University of Edinburgh, Edinburgh, Scotland, United Kongdom, Tech. Rep. HCRC/TR-83 (1997)
Zhan, P., Westphal, M.: Speaker Normalization Based on Frequency Warping. In: ICASSP (1997)
Gales, M.J.F.: Maximum Likelihood Linear Transformations for HMM-based Speech Recognition. Cambridge University, Cambridge, United Kingdom, Tech. Rep. (1997)
Leggetter, C.J., Woodland, P.C.: Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language 9, 171–185 (1995)
Yu, H., Tam, Y.-C., Schaaf, T., Stüker, S., Jin, Q., Noamany, M., Schultz, T.: The ISL RT04 Mandarin Broadcast News Evaluation System. In: EARS Rich Transcription Workshop (2004)
Lamel, L., Gauvain, J.-L.: Alternate Phone Models for Conversational Speech. In: ICASSP (2005)
Mangu, L., Brill, E., Stolcke, A.: Finding Consensus among Words: Lattice-based Word Error Minimization. In: EUROSPEECH (1999)
Wölfel, M., Fügen, C., Ikbal, S., McDonough, J.W.: Multi-Source Far-Distance Microphone Selection and Combination for Automatic Transcription of Lectures. In: INTERSPEECH (2006)
CHIL – Computers in the Human Interaction Loop, http://chil.server.de
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fügen, C. et al. (2006). The ISL RT-06S Speech-to-Text System. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_36
Download citation
DOI: https://doi.org/10.1007/11965152_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69267-6
Online ISBN: 978-3-540-69268-3
eBook Packages: Computer ScienceComputer Science (R0)