Skip to main content

On the Influence of Automatic Segmentation and Clustering in Automatic Speech Recognition

  • Conference paper
  • 733 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 328))

Abstract

An automatic speech recognition (ASR) system needs a previous segmentation stage that differentiates between speech and non-speech. Other information such as “who spoke when” can be proportioned to the ASR system, allowing it to perform speaker adaptation. This paper studies the influence of automatic speech segmentation and speaker clustering on ASR performance, in order to detect the weak points of the diarization system by analyzing what causes the different types of recognition errors: insertions, suppressions and substitutions. Experiments are run on the Galician broadcast news database Transcrigal, and results show that the speaker diarization system presented in this work is suitable as a previous step to ASR, as the performance is almost the same as the obtained when using manual segmentation and clustering.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Campbell, W.M., Sturim, D.E., Reynolds, D.A., Solomonoff, A.: SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 97–100 (2006)

    Google Scholar 

  2. Cardenal-Lopez, A., Dieguez-Tirado, F.J., Garcia-Mateo, C.: Fast LM look-ahead for large vocabulary continuous speech recognition using perfect hashing. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 705–708 (2002)

    Google Scholar 

  3. CLUTO - software for clustering high-dimensional datasets, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

  4. Garcia-Mateo, C., Dieguez-Tirado, J., Docio-Fernandez, L., Cardenal-Lopez, A.: Transcrigal: A bilingual system for automatic indexing of broadcast news In: Proceedings of LREC 2004: Fourth International Conference on Language Resources and Evaluation, pp. 2061–2064 (2004)

    Google Scholar 

  5. Herbig, T., Gerl, F., Minker, W.: Fast Adaptation of Speech and Speaker Characteristics for Enhanced Speech Recognition in Adverse Intelligent Environments. In: Proceedings of 6th International Conference on Intelligent Environments, pp. 100–105 (2010)

    Google Scholar 

  6. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys 31(3), 264–323 (1999)

    Article  Google Scholar 

  7. Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C.: Novel Strategies for Reducing the False Alarm Rate in a Speaker Segmentation System. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4970–4973 (2010)

    Google Scholar 

  8. NIST Speech Recognition Scoring Toolkit, http://www.itl.nist.gov/iad/mig/tools/

  9. Ortega, A., García, J.E., Miguel, A., Lleida, E.: Real-Time Live Broadcast News Subtitling System for Spanish. In: Proceedings of Interspeech, pp. 2095–2098 (2009)

    Google Scholar 

  10. Reynolds, D., Quatier, T., Dunn, R.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing 10, 19–41 (2000)

    Article  Google Scholar 

  11. Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6(2), 461–464 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  12. Setiawan, P., Suhadi, S., Fingscheidt, T., Stan, S.: Robust Speech Recognition for Mobile Devices in Car Noise. In: Proceedings of Interspeech, pp. 2673–2676 (2005)

    Google Scholar 

  13. The NIST Rich Transcription Evaluation Project Website, http://www.itl.nist.gov/iad/mig/tests/rt/

  14. Wang, Y., Han, J., Li, H., Zheng, T.: A Novel Audio Segmentation Method Based on Changing Trend of Distance between Audio Scenes. Journal of Communication and Computer 3, 22–30 (2006)

    Google Scholar 

  15. Yaman, S., Tur, G., Vergyri, D., Hakkani-Tur, D., Harper, M., Wang, W.: Anchored Speech Recognition for Question Answering. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 265–268 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C., Cardenal-Lopez, A. (2012). On the Influence of Automatic Segmentation and Clustering in Automatic Speech Recognition. In: Torre Toledano, D., et al. Advances in Speech and Language Technologies for Iberian Languages. Communications in Computer and Information Science, vol 328. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35292-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35292-8_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35291-1

  • Online ISBN: 978-3-642-35292-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics