On the Influence of Automatic Segmentation and Clustering in Automatic Speech Recognition

Lopez-Otero, Paula; Docio-Fernandez, Laura; Garcia-Mateo, Carmen; Cardenal-Lopez, Antonio

doi:10.1007/978-3-642-35292-8_6

Paula Lopez-Otero⁷,
Laura Docio-Fernandez⁷,
Carmen Garcia-Mateo⁷ &
…
Antonio Cardenal-Lopez⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 328))

752 Accesses

Abstract

An automatic speech recognition (ASR) system needs a previous segmentation stage that differentiates between speech and non-speech. Other information such as “who spoke when” can be proportioned to the ASR system, allowing it to perform speaker adaptation. This paper studies the influence of automatic speech segmentation and speaker clustering on ASR performance, in order to detect the weak points of the diarization system by analyzing what causes the different types of recognition errors: insertions, suppressions and substitutions. Experiments are run on the Galician broadcast news database Transcrigal, and results show that the speaker diarization system presented in this work is suitable as a previous step to ASR, as the performance is almost the same as the obtained when using manual segmentation and clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Novel Audio Segmentation for Audio Diarization

Segmental Analysis of Speech Signal for Robust Speaker Recognition System

Automatic Speech Recognition Based on Clustering Technique

References

Campbell, W.M., Sturim, D.E., Reynolds, D.A., Solomonoff, A.: SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 97–100 (2006)
Google Scholar
Cardenal-Lopez, A., Dieguez-Tirado, F.J., Garcia-Mateo, C.: Fast LM look-ahead for large vocabulary continuous speech recognition using perfect hashing. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 705–708 (2002)
Google Scholar
CLUTO - software for clustering high-dimensional datasets, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
Garcia-Mateo, C., Dieguez-Tirado, J., Docio-Fernandez, L., Cardenal-Lopez, A.: Transcrigal: A bilingual system for automatic indexing of broadcast news In: Proceedings of LREC 2004: Fourth International Conference on Language Resources and Evaluation, pp. 2061–2064 (2004)
Google Scholar
Herbig, T., Gerl, F., Minker, W.: Fast Adaptation of Speech and Speaker Characteristics for Enhanced Speech Recognition in Adverse Intelligent Environments. In: Proceedings of 6th International Conference on Intelligent Environments, pp. 100–105 (2010)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C.: Novel Strategies for Reducing the False Alarm Rate in a Speaker Segmentation System. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4970–4973 (2010)
Google Scholar
NIST Speech Recognition Scoring Toolkit, http://www.itl.nist.gov/iad/mig/tools/
Ortega, A., García, J.E., Miguel, A., Lleida, E.: Real-Time Live Broadcast News Subtitling System for Spanish. In: Proceedings of Interspeech, pp. 2095–2098 (2009)
Google Scholar
Reynolds, D., Quatier, T., Dunn, R.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing 10, 19–41 (2000)
Article Google Scholar
Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6(2), 461–464 (1978)
Article MathSciNet MATH Google Scholar
Setiawan, P., Suhadi, S., Fingscheidt, T., Stan, S.: Robust Speech Recognition for Mobile Devices in Car Noise. In: Proceedings of Interspeech, pp. 2673–2676 (2005)
Google Scholar
The NIST Rich Transcription Evaluation Project Website, http://www.itl.nist.gov/iad/mig/tests/rt/
Wang, Y., Han, J., Li, H., Zheng, T.: A Novel Audio Segmentation Method Based on Changing Trend of Distance between Audio Scenes. Journal of Communication and Computer 3, 22–30 (2006)
Google Scholar
Yaman, S., Tur, G., Vergyri, D., Hakkani-Tur, D., Harper, M., Wang, W.: Anchored Speech Recognition for Question Answering. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 265–268 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Multimedia Technologies Group (GTM), AtlantTIC Research Center, Universidade de Vigo, E.E. Telecomunicación, 36310, Vigo, Spain
Paula Lopez-Otero, Laura Docio-Fernandez, Carmen Garcia-Mateo & Antonio Cardenal-Lopez

Authors

Paula Lopez-Otero
View author publications
You can also search for this author in PubMed Google Scholar
Laura Docio-Fernandez
View author publications
You can also search for this author in PubMed Google Scholar
Carmen Garcia-Mateo
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Cardenal-Lopez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Escuela Politecnica Superior, Universidad Autonoma de Madrid. C/ Francisco, Tomas y Valiente 11, 28049, Madrid, Spain
Doroteo Torre Toledano
Centro Politécnico Superior, Edificio Ada Byron, C/ María de Luna nº 1, 50018, Zaragoza, Spain
Alfonso Ortega Giménez
Universidade de Aveiro, Campus Universitário Aveiro, 3810-193, Aveiro, Portugal
António Teixeira
Escuela Politecnica Superior, Universidad Autonoma de Madrid, C/ Francisco, Tomas y Valiente 11, 28049, Madrid, Spain
Joaquín González Rodríguez
E.T.S.I.Telecomunicacion, Universidad Politécnica de Madrid, Ciudad Universitaria s/n, 28040, Madrid, Spain
Luis Hernández Gómez & Rubén San Segundo Hernández &
Escuela Politecnica Superior, Universidad Autonoma de Madrid, C/ Francisco, Tomas y Valiente 11, 28049, Madrid, Spain
Daniel Ramos Castro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C., Cardenal-Lopez, A. (2012). On the Influence of Automatic Segmentation and Clustering in Automatic Speech Recognition. In: Torre Toledano, D., et al. Advances in Speech and Language Technologies for Iberian Languages. Communications in Computer and Information Science, vol 328. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35292-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-35292-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35291-1
Online ISBN: 978-3-642-35292-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics