The ISL RT-06S Speech-to-Text System

Fügen, Christian; Ikbal, Shajith; Kraft, Florian; Kumatani, Kenichi; Laskowski, Kornel; McDonough, John W.; Ostendorf, Mari; Stüker, Sebastian; Wölfel, Matthias

doi:10.1007/11965152_36

Christian Fügen¹⁹,
Shajith Ikbal¹⁹,
Florian Kraft¹⁹,
Kenichi Kumatani¹⁹,
Kornel Laskowski¹⁹,
John W. McDonough¹⁹,
Mari Ostendorf^19,20,
Sebastian Stüker¹⁹ &
…
Matthias Wölfel¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4299))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

772 Accesses

Abstract

This paper describes the 2006 lecture and conference meeting speech-to-text system developed at the Interactive Systems Laboratories (ISL), for the individual head-mounted microphone (IHM), single distant microphone (SDM), and multiple distant microphone (MDM) conditions, which was evaluated in the RT-06S Rich Transcription Meeting Evaluation sponsored by the US National Institute of Standards and Technologies (NIST). We describe the principal differences between our current system and those submitted in previous years, namely improved acoustic and language models, cross adaptation between systems with different front-ends and phoneme sets, and the use of various automatic speech segmentation algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

Comparison of Automatic Speech Recognition Systems

Significance of Audio Quality in Speech-to-Text Translation Systems

References

Fügen, C., Kolss, M., Bernreuther, D., Paulik, M., Stüker, S., Vogel, S., Waibel, A.: Open Domain Speech Recognition & Translation: Lectures and Speeches. In: ICASSP (2006)
Google Scholar
Wölfel, M., McDonough, J.: Combining Multi-Source Far Distance Speech Recognition Strategies: Beamforming, Blind Channel and Confusion Network Combination. In: INTERSPEECH (2005)
Google Scholar
Metze, F., Jin, Q., Fügen, C., Laskowski, K., Pan, Y., Schultz, T.: Issues in Meeting Transcription – The ISL Meeting Transcription System. In: ICSLP (2004)
Google Scholar
Wölfel, M., McDonough, J.: Minimum Variance Distortionless Response Spectral Estimation Review and Refinements. IEEE Signal Processing Magazine (September 2005)
Google Scholar
Stüker, S., Fügen, C., Burger, S., Wölfel, M.: Cross-System Adaptation and Combination for Continuous Speech Recognition: The Influence of Phoneme Set and Acoustic Front-End. In: INTERSPEECH (2006)
Google Scholar
Jin, Q., Schultz, T.: Speaker Segmentation and Clustering in Meetings. In: ICSLP (2004)
Google Scholar
Stüker, S., Fügen, C., Hsiao, R., Ikbal, S., Jin, Q., Kraft, F., Paulik, M., Raab, M.W.M., Tam, Y.-C.: The ISL TC-STAR Spring 2006 ASR Evaluation Systems. In: TC-Star Workshop on Speech-to-Speech Translation (2006)
Google Scholar
Makhoul, J.: Linear Prediction: A Tutorial Review. Proc. of the IEEE 63(4), 561–580 (1975)
Article Google Scholar
Fügen, C., Wölfel, M., McDonough, J.W., Ikbal, S., Kraft, F., Laskowski, K., Ostendorf, M., Stüker, S., Kumatani, K.: Advances in Lecture Recognition: The ISL RT-06S Evaluation System. In: INTERSPEECH (2006)
Google Scholar
Pfau, T., Ellis, D.P.W., Stolcke, A.: Multispeaker Speech Activity Detection for the ICSI Meeting Recorder. In: Proc. ASRU (2001)
Google Scholar
Wrigley, S.N., Brown, G.J., Wan, V., Renals, S.: Speech and Crosstalk Detection in Multichannel Audio. IEEE Trans. on Speech and Audio Processing 13, 84–91 (2005)
Article Google Scholar
Laskowski, K., Schultz, T.: Unsupervised Learning of Overlapped Speech Model Parameters for Multichannel Speech Activity Detection in Meetings. In: Proc. ICASSP (2006)
Google Scholar
Çetin, Ö., Shriberg, E.: Speaker Overlaps and ASR Errors in Meetings: Effects Before, During, and After the Overlap. In: Proc. ICASSP (2006)
Google Scholar
Soltau, H., Metze, F., Fügen, C., Waibel, A.: A One Pass-Decoder Based on Polymorphic Linguistic Context Assignment. In: ASRU (2001)
Google Scholar
Gales, M.J.F.: Semi-tied covariance matrices. In: ICASSP (1998)
Google Scholar
McDonough, J., Schaaf, T., Waibel, A.: On Maximum Mutual Information Speaker-Adapted Training. In: ICASSP (2002)
Google Scholar
Fisher, W.M.: A Statistical Text-to-Phone Function Using Ngrams and Rules. In: ICASSP (1999)
Google Scholar
Stolcke, A.: SRILM – An Extensible Language Modeling Toolkit. In: ICSLP (2002)
Google Scholar
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Computer Science Group, Harvard University, Tech. Rep. TR-10-98 (1998)
Google Scholar
Bulyko, I., Ostendorf, M., Stolcke, A.: Getting more Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures. In: Proc. HLT-NAACL (2003)
Google Scholar
Çetin, Ö., Stolcke, A.: Language Modeling in the ICSI-SRI Spring 2005 Meeting Speech Recognition Evaluation System. International Computer Science Institute, Berkeley, CA, USA, Tech. Rep. TR-05-006 (2005)
Google Scholar
Venkataraman, A., Wang, W.: Techniques for Effective Vocabulary Selection. In: Proc. Eurospeech (2003)
Google Scholar
Black, A.W., Taylor, P.A.: The Festival Speech Synthesis System: System documentation. Human Communciation Research Centre, University of Edinburgh, Edinburgh, Scotland, United Kongdom, Tech. Rep. HCRC/TR-83 (1997)
Google Scholar
Zhan, P., Westphal, M.: Speaker Normalization Based on Frequency Warping. In: ICASSP (1997)
Google Scholar
Gales, M.J.F.: Maximum Likelihood Linear Transformations for HMM-based Speech Recognition. Cambridge University, Cambridge, United Kingdom, Tech. Rep. (1997)
Google Scholar
Leggetter, C.J., Woodland, P.C.: Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language 9, 171–185 (1995)
Article Google Scholar
Yu, H., Tam, Y.-C., Schaaf, T., Stüker, S., Jin, Q., Noamany, M., Schultz, T.: The ISL RT04 Mandarin Broadcast News Evaluation System. In: EARS Rich Transcription Workshop (2004)
Google Scholar
Lamel, L., Gauvain, J.-L.: Alternate Phone Models for Conversational Speech. In: ICASSP (2005)
Google Scholar
Mangu, L., Brill, E., Stolcke, A.: Finding Consensus among Words: Lattice-based Word Error Minimization. In: EUROSPEECH (1999)
Google Scholar
Wölfel, M., Fügen, C., Ikbal, S., McDonough, J.W.: Multi-Source Far-Distance Microphone Selection and Combination for Automatic Transcription of Lectures. In: INTERSPEECH (2006)
Google Scholar
CHIL – Computers in the Human Interaction Loop, http://chil.server.de

Download references

Author information

Authors and Affiliations

Interactive Systems Laboratories, Universität Karlsruhe (TH), Karlsruhe, Germany
Christian Fügen, Shajith Ikbal, Florian Kraft, Kenichi Kumatani, Kornel Laskowski, John W. McDonough, Mari Ostendorf, Sebastian Stüker & Matthias Wölfel
Dept. of Electrical Engineering, University of Washington, Seattle, WA, USA
Mari Ostendorf

Authors

Christian Fügen
View author publications
You can also search for this author in PubMed Google Scholar
Shajith Ikbal
View author publications
You can also search for this author in PubMed Google Scholar
Florian Kraft
View author publications
You can also search for this author in PubMed Google Scholar
Kenichi Kumatani
View author publications
You can also search for this author in PubMed Google Scholar
Kornel Laskowski
View author publications
You can also search for this author in PubMed Google Scholar
John W. McDonough
View author publications
You can also search for this author in PubMed Google Scholar
Mari Ostendorf
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Stüker
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Wölfel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, Scotland
Steve Renals
IDIAP Research Institute, Martigny, Switzerland
Samy Bengio
National Institute Of Standards and Technology, 100 Bureau Drive Stop 8940, Gaithersburg, MD, 20899
Jonathan G. Fiscus

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fügen, C. et al. (2006). The ISL RT-06S Speech-to-Text System. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_36

Download citation

DOI: https://doi.org/10.1007/11965152_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69267-6
Online ISBN: 978-3-540-69268-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The ISL RT-06S Speech-to-Text System

Abstract

Access this chapter

Preview

Similar content being viewed by others

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

Comparison of Automatic Speech Recognition Systems

Significance of Audio Quality in Speech-to-Text Translation Systems

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

The ISL RT-06S Speech-to-Text System

Abstract

Access this chapter

Preview

Similar content being viewed by others

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

Comparison of Automatic Speech Recognition Systems

Significance of Audio Quality in Speech-to-Text Translation Systems

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation