Skip to main content

The IBM Rich Transcription 2007 Speech-to-Text Systems for Lecture Meetings

  • Conference paper
Multimodal Technologies for Perception of Humans (RT 2007, CLEAR 2007)

Abstract

The paper describes the IBM systems submitted to the NIST Rich Transcription 2007 (RT07) evaluation campaign for the speech-to-text (STT) and speaker-attributed speech-to-text (SASTT) tasks on the lecture meeting domain. Three testing conditions are considered, namely the multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM) ones – the latter for the STT task only. The IBM system building process is similar to that employed last year for the STT Rich Transcription Spring 2006 evaluation (RT06s). However, a few technical advances have been introduced for RT07: (a) better speaker segmentation; (b) system combination via the ROVER approach applied over an ensemble of systems, some of which are built by randomized decision tree state-tying; and (c) development of a very large language model consisting of 152M n-grams, incorporating, among other sources, 525M words of web data, and used in conjunction with a dynamic decoder. These advances reduce STT word error rate (WER) in the MDM condition by 16% relative (8% absolute) over the IBM RT06s system, as measured on 17 lecture meeting segments of the RT06s evaluation test set, selected in this work as development data. In the RT07 evaluation campaign, both MDM and SDM systems perform competitively for the STT and SASTT tasks. For example, at the MDM condition, a 44.3% STT WER is achieved on the RT07 evaluation test set, excluding scoring of overlapped speech. When the STT transcripts are combined with speaker labels from speaker diarization, SASTT WER becomes 52.0%. For the STT IHM condition, the newly developed large language model is employed, but in conjunction with the RT06s IHM acoustic models. The latter are reused, due to lack of time to train new models to utilize additional close-talking microphone data available in RT07. Therefore, the resulting system achieves modest WERs of 31.7% and 33.4%, when using manual or automatic segmentation, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Computers in the Human Interaction Loop, http://chil.server.de

  2. Augmented Multi-party Interaction, http://www.amiproject.org

  3. The NIST SmartSpace Laboratory, http://www.nist.gov/smartspace

  4. Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The Rich Transcription 2006 Spring meeting recognition evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 309–322. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. Huang, J., Marcheret, E., Visweswariah, K., Potamianos, G.: The IBM RT07 evaluation system for speaker diarization in CHIL seminars (same volume) (2007)

    Google Scholar 

  6. Huang, J., Westphal, M., Chen, S., et al.: The IBM Rich Transcription Spring 2006 speech-to-text system for lecture meetings. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 432–443. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  7. Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). In: Proc. Automatic Speech Recognition Underst. Works, Santa Barbara, CA, pp. 347–352 (1997)

    Google Scholar 

  8. Siohan, O., Ramabhadran, B., Kingsbury, B.: Constructing ensembles of ASR systems using randomized decision trees. In: Proc. Int. Conf. Acoustics Speech Signal Process, Philadelphia, vol. 1, pp. 197–200 (2005)

    Google Scholar 

  9. The LDC Corpus Catalog, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, http://www.ldc.upenn.edu/Catalog

  10. Lamel, L.F., Schiel, F., Fourcin, A., Mariani, J., Tillmann, H.: The translanguage English database (TED). In: Proc. Int. Conf. Spoken Language Process, Yokohama, Japan (1994)

    Google Scholar 

  11. Boakye, K., Stolcke, A.: Improved speech activity detection using cross-channel features for recognition of multiparty meetings. In: Proc. Int. Conf. Spoken Language Process, Pittsburgh, pp. 1962–1965 (2006)

    Google Scholar 

  12. Marcheret, E., Potamianos, G., Visweswariah, K., Huang, J.: The IBM RT06s evaluation system for speech activity detection in CHIL seminars. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 323–335. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  13. Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 12, 75–98 (1998)

    Article  Google Scholar 

  14. Saon, G., Zweig, G., Padmanabhan, M.: Linear feature space projections for speaker adaptation. In: Proc. Int. Conf. Acoustics Speech Signal Process, Salt Lake City, UT, pp. 325–328 (2001)

    Google Scholar 

  15. Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G.: fMPE: Discriminatively trained features for speech recognition. In: Proc. Int. Conf. Acoustics Speech Signal Process, Philadelphia, vol. 1, pp. 961–964 (2005)

    Google Scholar 

  16. Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proc. Int. Conf. Acoustics Speech Signal Process, Orlando, FL, pp. 105–108 (2002)

    Google Scholar 

  17. Zheng, J., Stolcke, A.: Improved discriminative training using phone lattices. In: Proc. Eurospeech, Lisbon, Portugal, pp. 2125–2128 (2005)

    Google Scholar 

  18. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 359–393 (1999)

    Article  Google Scholar 

  19. Stolcke, A.: Entropy-based pruning of backoff language models. In: Proc. DARPA Broadcast News Transcr. Underst. Works, Lansdowne, VA, pp. 270–274 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Rainer Stiefelhagen Rachel Bowers Jonathan Fiscus

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Huang, J., Marcheret, E., Visweswariah, K., Libal, V., Potamianos, G. (2008). The IBM Rich Transcription 2007 Speech-to-Text Systems for Lecture Meetings. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds) Multimodal Technologies for Perception of Humans. RT CLEAR 2007 2007. Lecture Notes in Computer Science, vol 4625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68585-2_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68585-2_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68584-5

  • Online ISBN: 978-3-540-68585-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics