Abstract
An automatic speech recognition (ASR) system needs a previous segmentation stage that differentiates between speech and non-speech. Other information such as “who spoke when” can be proportioned to the ASR system, allowing it to perform speaker adaptation. This paper studies the influence of automatic speech segmentation and speaker clustering on ASR performance, in order to detect the weak points of the diarization system by analyzing what causes the different types of recognition errors: insertions, suppressions and substitutions. Experiments are run on the Galician broadcast news database Transcrigal, and results show that the speaker diarization system presented in this work is suitable as a previous step to ASR, as the performance is almost the same as the obtained when using manual segmentation and clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Campbell, W.M., Sturim, D.E., Reynolds, D.A., Solomonoff, A.: SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 97–100 (2006)
Cardenal-Lopez, A., Dieguez-Tirado, F.J., Garcia-Mateo, C.: Fast LM look-ahead for large vocabulary continuous speech recognition using perfect hashing. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 705–708 (2002)
CLUTO - software for clustering high-dimensional datasets, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
Garcia-Mateo, C., Dieguez-Tirado, J., Docio-Fernandez, L., Cardenal-Lopez, A.: Transcrigal: A bilingual system for automatic indexing of broadcast news In: Proceedings of LREC 2004: Fourth International Conference on Language Resources and Evaluation, pp. 2061–2064 (2004)
Herbig, T., Gerl, F., Minker, W.: Fast Adaptation of Speech and Speaker Characteristics for Enhanced Speech Recognition in Adverse Intelligent Environments. In: Proceedings of 6th International Conference on Intelligent Environments, pp. 100–105 (2010)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys 31(3), 264–323 (1999)
Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C.: Novel Strategies for Reducing the False Alarm Rate in a Speaker Segmentation System. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4970–4973 (2010)
NIST Speech Recognition Scoring Toolkit, http://www.itl.nist.gov/iad/mig/tools/
Ortega, A., García, J.E., Miguel, A., Lleida, E.: Real-Time Live Broadcast News Subtitling System for Spanish. In: Proceedings of Interspeech, pp. 2095–2098 (2009)
Reynolds, D., Quatier, T., Dunn, R.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing 10, 19–41 (2000)
Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6(2), 461–464 (1978)
Setiawan, P., Suhadi, S., Fingscheidt, T., Stan, S.: Robust Speech Recognition for Mobile Devices in Car Noise. In: Proceedings of Interspeech, pp. 2673–2676 (2005)
The NIST Rich Transcription Evaluation Project Website, http://www.itl.nist.gov/iad/mig/tests/rt/
Wang, Y., Han, J., Li, H., Zheng, T.: A Novel Audio Segmentation Method Based on Changing Trend of Distance between Audio Scenes. Journal of Communication and Computer 3, 22–30 (2006)
Yaman, S., Tur, G., Vergyri, D., Hakkani-Tur, D., Harper, M., Wang, W.: Anchored Speech Recognition for Question Answering. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 265–268 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C., Cardenal-Lopez, A. (2012). On the Influence of Automatic Segmentation and Clustering in Automatic Speech Recognition. In: Torre Toledano, D., et al. Advances in Speech and Language Technologies for Iberian Languages. Communications in Computer and Information Science, vol 328. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35292-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-35292-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35291-1
Online ISBN: 978-3-642-35292-8
eBook Packages: Computer ScienceComputer Science (R0)