KALDI Recipes for the Czech Speech Recognition Under Various Conditions

Mizera, Petr; Fiala, Jiří; Brich, Aleš; Pollak, Petr

doi:10.1007/978-3-319-45510-5_45

Petr Mizera¹⁷,
Jiří Fiala¹⁷,
Aleš Brich¹⁷ &
…
Petr Pollak¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1882 Accesses

Abstract

The paper presents the implementation of Czech ASR system under various conditions using KALDI speech recognition toolkit in two standard state-of-the-art architectures (GMM-HMM and DNN-HMM). We present the recipes for the building of LVCSR using SpeechDat, SPEECON, CZKCC, and NCCCz corpora with the new update of feature extraction tool CtuCopy which supports currently KALDI format. All presented recipes same as CtuCopy tool are publicly available under the Apache license v2.0. Finally, an extension of KALDI toolkit which supports the running of described LVCSR recipes on MetaCentrum computing facilities (Czech National Grid Infrastructure operated by CESNET) is described. In the experimental part the baseline performance of both GMM-HMM and DNN-HMM LVCSR systems applied on given Czech corpora is presented. These results also demonstrate the behaviour of designed LVCSR under various acoustic conditions same as various speaking styles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Experimental Study of Continuous Automatic Speech Recognition System Using MFCC with Reference to Punjabi Language

Enhancing Performance of Noise-Robust Gujarati Language ASR Utilizing the Hybrid Acoustic Model and Combined MFCC + GTCC Feature

Toolkits for Robust Speech Processing

Notes

1.
The corpus was collected for TEMIC Speech Dialogue Systems GmbH in Ulm at Czech Technical University in Prague in co-operation with Brno University of Technology and University of West Bohemia in Plzen.
2.
The corpus was collected with focus on understanding of very informal speaking style in the collaborative research realized at CTU in Prague and Radboud University Nijmegen.
3.
More information can be found in official KALDI documentation http://kaldi.sourceforge.net/data_prep.html.

References

Bolaños, D.: The BAVIECA open-source speech recognition toolkit. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 354–359, December 2012
Google Scholar
Borsky, M., Mizera, P., Pollak, P.: Noise and channel normalized cepstral features for far-speech recognition. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 241–248. Springer, Heidelberg (2013)
Chapter Google Scholar
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)
Article Google Scholar
Ernestus, M., Kockova-Amortova, L., Pollak, P.: The Nijmegen corpus of casual Czech. In: Proceedings of the LREC 2014: 9th International Conference on Language Resources and Evaluation, Reykjavik, Iceland, pp. 365–370 (2014)
Google Scholar
Fousek, P., Pollak, P.: Efficient and reliable measurement and simulation of noisy speech background. In: Proceedings of the EUROSPEECH 2003, 8-th European Conference on Speech Communication and Technology, Geneve, Switzerland (2003)
Google Scholar
Fousek, P., Mizera, P., Pollak, P.: CtuCopy feature extraction tool. http://noel.feld.cvut.cz/speechlab/
Gales, M.J.F., Woodland, P.C.: Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10, 249–264 (1996)
Article Google Scholar
Ghoshal, A., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH (2013)
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Klejch, O., Plátek, O., Žilka, L., Jurcícek, F.: CloudASR: platform and service. In: Král, P., et al. (eds.) TSD 2015. LNCS, vol. 9302, pp. 334–341. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24033-6_38
Chapter Google Scholar
Korvas, M., Platek, O., Duvsek, O., Zilka, L., Jurcicek, F.: Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license. In: Proceedings of the LREC 2014: 9th International Conference on Language Resources and Evaluation, Reykjavik, Iceland (2014)
Google Scholar
Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., Warmuth, M., Wolf, P.: The CMU SPHINX-4 speech recognition system. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2003, Hong Kong, China (2003)
Google Scholar
Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P., Narayanan, S., Leuski, A., Traum, D.: Which ASR should I choose for my dialogue system? In: SIGDIAL, Reykjavik, Iceland (2013)
Google Scholar
Nouza, J., Blavka, K., Bohac, M., Červa, P., Malek, J.: System for producing subtitles to internet audio-visual documents. In: 2015 38th International Conference on Telecommunications and Signal Processing (TSP), pp. 1–5, July 2015
Google Scholar
Nouza, J., Ždansky, J., Červa, P.: System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search. In: Proceedings of 15th IEEE MELECON Conference, pp. 202–205, La Valleta, Malta (2010)
Google Scholar
Patc, Z., Mizera, P., Pollak, P.: Phonetic segmentation using KALDI and reduced pronunciation detection in causal Czech speech. In: Král, P., et al. (eds.) TSD 2015. LNCS, vol. 9302, pp. 433–441. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24033-6_49
Chapter Google Scholar
Pollak, P., Černocký, J.: Czech SPEECON adult database. Technical report, April 2004
Google Scholar
Pollák, P., Boudy, J., Choukri, K., Heuvel, H.V.D., Vicsi, K., Virag, A., Siemund, R., Majewski, W., Staroniewicz, P., Tropf, H., Kochanina, J., Ostroukhov, E., Rusko, M., Trnka, M.: SpeechDat(E)- Eastern European telephone speech databases. In: Proceedings of the XLDB 2000, Workshop on Very Large Telephone Speech Databases (2000)
Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The KALDI speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011
Google Scholar
Procházka, V., Pollak, P., Ždansky, J., Nouza, J.: Performance of Czech speech recognition with language models created from public resources. Radioengineering 20, 1002–1008 (2011)
Google Scholar
Rybach, D., Hahn, S., Lehnen, P., Nolden, D., Sundermeyer, M., Tüske, Z., Wiesler, S., Schlüter, R., Ney, H.: Rasr-the RWTH Aachen university open source speech recognition toolkit
Google Scholar
Veselý, K., Karafiát, M., Grezl, F.: Convolutive bottleneck network features for LVCSR. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (2011)
Google Scholar
Young, S., et al.: The HTK Book, Version 3.4.1. Cambridge (2009)
Google Scholar

Download references

Acknowledgments

The research described in this paper was supported by internal CTU grant SGS14 /191/OHK3/3T/13 “Advanced Algorithms of Digital Signal Processing and their Applications”. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Faculty of Electrical Engineering, Czech Technical University in Prague, K13131, Technicka 2, 166 27, Praha 6, Czech Republic
Petr Mizera, Jiří Fiala, Aleš Brich & Petr Pollak

Authors

Petr Mizera
View author publications
You can also search for this author in PubMed Google Scholar
Jiří Fiala
View author publications
You can also search for this author in PubMed Google Scholar
Aleš Brich
View author publications
You can also search for this author in PubMed Google Scholar
Petr Pollak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petr Mizera .

Editor information

Editors and Affiliations

Masaryk University , Brno, Czech Republic
Petr Sojka
Masaryk University , Brno, Czech Republic
Aleš Horák
Masaryk University , Brno, Czech Republic
Ivan Kopeček
Masaryk University , Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mizera, P., Fiala, J., Brich, A., Pollak, P. (2016). KALDI Recipes for the Czech Speech Recognition Under Various Conditions. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_45

Download citation

DOI: https://doi.org/10.1007/978-3-319-45510-5_45
Published: 03 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45509-9
Online ISBN: 978-3-319-45510-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics