Abstract
The paper presents the implementation of Czech ASR system under various conditions using KALDI speech recognition toolkit in two standard state-of-the-art architectures (GMM-HMM and DNN-HMM). We present the recipes for the building of LVCSR using SpeechDat, SPEECON, CZKCC, and NCCCz corpora with the new update of feature extraction tool CtuCopy which supports currently KALDI format. All presented recipes same as CtuCopy tool are publicly available under the Apache license v2.0. Finally, an extension of KALDI toolkit which supports the running of described LVCSR recipes on MetaCentrum computing facilities (Czech National Grid Infrastructure operated by CESNET) is described. In the experimental part the baseline performance of both GMM-HMM and DNN-HMM LVCSR systems applied on given Czech corpora is presented. These results also demonstrate the behaviour of designed LVCSR under various acoustic conditions same as various speaking styles.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The corpus was collected for TEMIC Speech Dialogue Systems GmbH in Ulm at Czech Technical University in Prague in co-operation with Brno University of Technology and University of West Bohemia in Plzen.
- 2.
The corpus was collected with focus on understanding of very informal speaking style in the collaborative research realized at CTU in Prague and Radboud University Nijmegen.
- 3.
More information can be found in official KALDI documentation http://kaldi.sourceforge.net/data_prep.html.
References
Bolaños, D.: The BAVIECA open-source speech recognition toolkit. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 354–359, December 2012
Borsky, M., Mizera, P., Pollak, P.: Noise and channel normalized cepstral features for far-speech recognition. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 241–248. Springer, Heidelberg (2013)
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)
Ernestus, M., Kockova-Amortova, L., Pollak, P.: The Nijmegen corpus of casual Czech. In: Proceedings of the LREC 2014: 9th International Conference on Language Resources and Evaluation, Reykjavik, Iceland, pp. 365–370 (2014)
Fousek, P., Pollak, P.: Efficient and reliable measurement and simulation of noisy speech background. In: Proceedings of the EUROSPEECH 2003, 8-th European Conference on Speech Communication and Technology, Geneve, Switzerland (2003)
Fousek, P., Mizera, P., Pollak, P.: CtuCopy feature extraction tool. http://noel.feld.cvut.cz/speechlab/
Gales, M.J.F., Woodland, P.C.: Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10, 249–264 (1996)
Ghoshal, A., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH (2013)
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Klejch, O., Plátek, O., Žilka, L., Jurcícek, F.: CloudASR: platform and service. In: Král, P., et al. (eds.) TSD 2015. LNCS, vol. 9302, pp. 334–341. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24033-6_38
Korvas, M., Platek, O., Duvsek, O., Zilka, L., Jurcicek, F.: Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license. In: Proceedings of the LREC 2014: 9th International Conference on Language Resources and Evaluation, Reykjavik, Iceland (2014)
Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., Warmuth, M., Wolf, P.: The CMU SPHINX-4 speech recognition system. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2003, Hong Kong, China (2003)
Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P., Narayanan, S., Leuski, A., Traum, D.: Which ASR should I choose for my dialogue system? In: SIGDIAL, Reykjavik, Iceland (2013)
Nouza, J., Blavka, K., Bohac, M., Červa, P., Malek, J.: System for producing subtitles to internet audio-visual documents. In: 2015 38th International Conference on Telecommunications and Signal Processing (TSP), pp. 1–5, July 2015
Nouza, J., Ždansky, J., Červa, P.: System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search. In: Proceedings of 15th IEEE MELECON Conference, pp. 202–205, La Valleta, Malta (2010)
Patc, Z., Mizera, P., Pollak, P.: Phonetic segmentation using KALDI and reduced pronunciation detection in causal Czech speech. In: Král, P., et al. (eds.) TSD 2015. LNCS, vol. 9302, pp. 433–441. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24033-6_49
Pollak, P., Černocký, J.: Czech SPEECON adult database. Technical report, April 2004
Pollák, P., Boudy, J., Choukri, K., Heuvel, H.V.D., Vicsi, K., Virag, A., Siemund, R., Majewski, W., Staroniewicz, P., Tropf, H., Kochanina, J., Ostroukhov, E., Rusko, M., Trnka, M.: SpeechDat(E)- Eastern European telephone speech databases. In: Proceedings of the XLDB 2000, Workshop on Very Large Telephone Speech Databases (2000)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The KALDI speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011
Procházka, V., Pollak, P., Ždansky, J., Nouza, J.: Performance of Czech speech recognition with language models created from public resources. Radioengineering 20, 1002–1008 (2011)
Rybach, D., Hahn, S., Lehnen, P., Nolden, D., Sundermeyer, M., Tüske, Z., Wiesler, S., Schlüter, R., Ney, H.: Rasr-the RWTH Aachen university open source speech recognition toolkit
Veselý, K., Karafiát, M., Grezl, F.: Convolutive bottleneck network features for LVCSR. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (2011)
Young, S., et al.: The HTK Book, Version 3.4.1. Cambridge (2009)
Acknowledgments
The research described in this paper was supported by internal CTU grant SGS14 /191/OHK3/3T/13 “Advanced Algorithms of Digital Signal Processing and their Applications”. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Mizera, P., Fiala, J., Brich, A., Pollak, P. (2016). KALDI Recipes for the Czech Speech Recognition Under Various Conditions. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_45
Download citation
DOI: https://doi.org/10.1007/978-3-319-45510-5_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45509-9
Online ISBN: 978-3-319-45510-5
eBook Packages: Computer ScienceComputer Science (R0)