Abstract
The paper describes the IBM systems submitted to the NIST Rich Transcription 2007 (RT07) evaluation campaign for the speech-to-text (STT) and speaker-attributed speech-to-text (SASTT) tasks on the lecture meeting domain. Three testing conditions are considered, namely the multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM) ones – the latter for the STT task only. The IBM system building process is similar to that employed last year for the STT Rich Transcription Spring 2006 evaluation (RT06s). However, a few technical advances have been introduced for RT07: (a) better speaker segmentation; (b) system combination via the ROVER approach applied over an ensemble of systems, some of which are built by randomized decision tree state-tying; and (c) development of a very large language model consisting of 152M n-grams, incorporating, among other sources, 525M words of web data, and used in conjunction with a dynamic decoder. These advances reduce STT word error rate (WER) in the MDM condition by 16% relative (8% absolute) over the IBM RT06s system, as measured on 17 lecture meeting segments of the RT06s evaluation test set, selected in this work as development data. In the RT07 evaluation campaign, both MDM and SDM systems perform competitively for the STT and SASTT tasks. For example, at the MDM condition, a 44.3% STT WER is achieved on the RT07 evaluation test set, excluding scoring of overlapped speech. When the STT transcripts are combined with speaker labels from speaker diarization, SASTT WER becomes 52.0%. For the STT IHM condition, the newly developed large language model is employed, but in conjunction with the RT06s IHM acoustic models. The latter are reused, due to lack of time to train new models to utilize additional close-talking microphone data available in RT07. Therefore, the resulting system achieves modest WERs of 31.7% and 33.4%, when using manual or automatic segmentation, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Computers in the Human Interaction Loop, http://chil.server.de
Augmented Multi-party Interaction, http://www.amiproject.org
The NIST SmartSpace Laboratory, http://www.nist.gov/smartspace
Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The Rich Transcription 2006 Spring meeting recognition evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 309–322. Springer, Heidelberg (2006)
Huang, J., Marcheret, E., Visweswariah, K., Potamianos, G.: The IBM RT07 evaluation system for speaker diarization in CHIL seminars (same volume) (2007)
Huang, J., Westphal, M., Chen, S., et al.: The IBM Rich Transcription Spring 2006 speech-to-text system for lecture meetings. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 432–443. Springer, Heidelberg (2006)
Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). In: Proc. Automatic Speech Recognition Underst. Works, Santa Barbara, CA, pp. 347–352 (1997)
Siohan, O., Ramabhadran, B., Kingsbury, B.: Constructing ensembles of ASR systems using randomized decision trees. In: Proc. Int. Conf. Acoustics Speech Signal Process, Philadelphia, vol. 1, pp. 197–200 (2005)
The LDC Corpus Catalog, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, http://www.ldc.upenn.edu/Catalog
Lamel, L.F., Schiel, F., Fourcin, A., Mariani, J., Tillmann, H.: The translanguage English database (TED). In: Proc. Int. Conf. Spoken Language Process, Yokohama, Japan (1994)
Boakye, K., Stolcke, A.: Improved speech activity detection using cross-channel features for recognition of multiparty meetings. In: Proc. Int. Conf. Spoken Language Process, Pittsburgh, pp. 1962–1965 (2006)
Marcheret, E., Potamianos, G., Visweswariah, K., Huang, J.: The IBM RT06s evaluation system for speech activity detection in CHIL seminars. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 323–335. Springer, Heidelberg (2006)
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 12, 75–98 (1998)
Saon, G., Zweig, G., Padmanabhan, M.: Linear feature space projections for speaker adaptation. In: Proc. Int. Conf. Acoustics Speech Signal Process, Salt Lake City, UT, pp. 325–328 (2001)
Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G.: fMPE: Discriminatively trained features for speech recognition. In: Proc. Int. Conf. Acoustics Speech Signal Process, Philadelphia, vol. 1, pp. 961–964 (2005)
Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proc. Int. Conf. Acoustics Speech Signal Process, Orlando, FL, pp. 105–108 (2002)
Zheng, J., Stolcke, A.: Improved discriminative training using phone lattices. In: Proc. Eurospeech, Lisbon, Portugal, pp. 2125–2128 (2005)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 359–393 (1999)
Stolcke, A.: Entropy-based pruning of backoff language models. In: Proc. DARPA Broadcast News Transcr. Underst. Works, Lansdowne, VA, pp. 270–274 (1998)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, J., Marcheret, E., Visweswariah, K., Libal, V., Potamianos, G. (2008). The IBM Rich Transcription 2007 Speech-to-Text Systems for Lecture Meetings. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds) Multimodal Technologies for Perception of Humans. RT CLEAR 2007 2007. Lecture Notes in Computer Science, vol 4625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68585-2_40
Download citation
DOI: https://doi.org/10.1007/978-3-540-68585-2_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68584-5
Online ISBN: 978-3-540-68585-2
eBook Packages: Computer ScienceComputer Science (R0)