The IBM Rich Transcription 2007 Speech-to-Text Systems for Lecture Meetings

Huang, Jing; Marcheret, Etienne; Visweswariah, Karthik; Libal, Vit; Potamianos, Gerasimos

doi:10.1007/978-3-540-68585-2_40

Jing Huang¹,
Etienne Marcheret¹,
Karthik Visweswariah¹,
Vit Libal¹ &
…
Gerasimos Potamianos¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4625))

Included in the following conference series:

1244 Accesses
5 Citations

Abstract

The paper describes the IBM systems submitted to the NIST Rich Transcription 2007 (RT07) evaluation campaign for the speech-to-text (STT) and speaker-attributed speech-to-text (SASTT) tasks on the lecture meeting domain. Three testing conditions are considered, namely the multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM) ones – the latter for the STT task only. The IBM system building process is similar to that employed last year for the STT Rich Transcription Spring 2006 evaluation (RT06s). However, a few technical advances have been introduced for RT07: (a) better speaker segmentation; (b) system combination via the ROVER approach applied over an ensemble of systems, some of which are built by randomized decision tree state-tying; and (c) development of a very large language model consisting of 152M n-grams, incorporating, among other sources, 525M words of web data, and used in conjunction with a dynamic decoder. These advances reduce STT word error rate (WER) in the MDM condition by 16% relative (8% absolute) over the IBM RT06s system, as measured on 17 lecture meeting segments of the RT06s evaluation test set, selected in this work as development data. In the RT07 evaluation campaign, both MDM and SDM systems perform competitively for the STT and SASTT tasks. For example, at the MDM condition, a 44.3% STT WER is achieved on the RT07 evaluation test set, excluding scoring of overlapped speech. When the STT transcripts are combined with speaker labels from speaker diarization, SASTT WER becomes 52.0%. For the STT IHM condition, the newly developed large language model is employed, but in conjunction with the RT06s IHM acoustic models. The latter are reused, due to lack of time to train new models to utilize additional close-talking microphone data available in RT07. Therefore, the resulting system achieves modest WERs of 31.7% and 33.4%, when using manual or automatic segmentation, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Computers in the Human Interaction Loop, http://chil.server.de
Augmented Multi-party Interaction, http://www.amiproject.org
The NIST SmartSpace Laboratory, http://www.nist.gov/smartspace
Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The Rich Transcription 2006 Spring meeting recognition evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 309–322. Springer, Heidelberg (2006)
Chapter Google Scholar
Huang, J., Marcheret, E., Visweswariah, K., Potamianos, G.: The IBM RT07 evaluation system for speaker diarization in CHIL seminars (same volume) (2007)
Google Scholar
Huang, J., Westphal, M., Chen, S., et al.: The IBM Rich Transcription Spring 2006 speech-to-text system for lecture meetings. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 432–443. Springer, Heidelberg (2006)
Chapter Google Scholar
Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). In: Proc. Automatic Speech Recognition Underst. Works, Santa Barbara, CA, pp. 347–352 (1997)
Google Scholar
Siohan, O., Ramabhadran, B., Kingsbury, B.: Constructing ensembles of ASR systems using randomized decision trees. In: Proc. Int. Conf. Acoustics Speech Signal Process, Philadelphia, vol. 1, pp. 197–200 (2005)
Google Scholar
The LDC Corpus Catalog, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, http://www.ldc.upenn.edu/Catalog
Lamel, L.F., Schiel, F., Fourcin, A., Mariani, J., Tillmann, H.: The translanguage English database (TED). In: Proc. Int. Conf. Spoken Language Process, Yokohama, Japan (1994)
Google Scholar
Boakye, K., Stolcke, A.: Improved speech activity detection using cross-channel features for recognition of multiparty meetings. In: Proc. Int. Conf. Spoken Language Process, Pittsburgh, pp. 1962–1965 (2006)
Google Scholar
Marcheret, E., Potamianos, G., Visweswariah, K., Huang, J.: The IBM RT06s evaluation system for speech activity detection in CHIL seminars. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 323–335. Springer, Heidelberg (2006)
Chapter Google Scholar
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 12, 75–98 (1998)
Article Google Scholar
Saon, G., Zweig, G., Padmanabhan, M.: Linear feature space projections for speaker adaptation. In: Proc. Int. Conf. Acoustics Speech Signal Process, Salt Lake City, UT, pp. 325–328 (2001)
Google Scholar
Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G.: fMPE: Discriminatively trained features for speech recognition. In: Proc. Int. Conf. Acoustics Speech Signal Process, Philadelphia, vol. 1, pp. 961–964 (2005)
Google Scholar
Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proc. Int. Conf. Acoustics Speech Signal Process, Orlando, FL, pp. 105–108 (2002)
Google Scholar
Zheng, J., Stolcke, A.: Improved discriminative training using phone lattices. In: Proc. Eurospeech, Lisbon, Portugal, pp. 2125–2128 (2005)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 359–393 (1999)
Article Google Scholar
Stolcke, A.: Entropy-based pruning of backoff language models. In: Proc. DARPA Broadcast News Transcr. Underst. Works, Lansdowne, VA, pp. 270–274 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, U.S.A.
Jing Huang, Etienne Marcheret, Karthik Visweswariah, Vit Libal & Gerasimos Potamianos

Authors

Jing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Etienne Marcheret
View author publications
You can also search for this author in PubMed Google Scholar
Karthik Visweswariah
View author publications
You can also search for this author in PubMed Google Scholar
Vit Libal
View author publications
You can also search for this author in PubMed Google Scholar
Gerasimos Potamianos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Rainer Stiefelhagen Rachel Bowers Jonathan Fiscus

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, J., Marcheret, E., Visweswariah, K., Libal, V., Potamianos, G. (2008). The IBM Rich Transcription 2007 Speech-to-Text Systems for Lecture Meetings. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds) Multimodal Technologies for Perception of Humans. RT CLEAR 2007 2007. Lecture Notes in Computer Science, vol 4625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68585-2_40

Download citation

DOI: https://doi.org/10.1007/978-3-540-68585-2_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68584-5
Online ISBN: 978-3-540-68585-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics