Deep Neural Network Based Continuous Speech Recognition for Serbian Using the Kaldi Toolkit

Popović, Branislav; Ostrogonac, Stevan; Pakoci, Edvin; Jakovljević, Nikša; Delić, Vlado

doi:10.1007/978-3-319-23132-7_23

Branislav Popović⁷,
Stevan Ostrogonac⁷,
Edvin Pakoci⁷,
Nikša Jakovljević⁷ &
…
Vlado Delić⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9319))

Included in the following conference series:

International Conference on Speech and Computer

1738 Accesses
12 Citations

Abstract

This paper presents a deep neural network (DNN) based large vocabulary continuous speech recognition (LVCSR) system for Serbian, developed using the open-source Kaldi speech recognition toolkit. The DNNs are initialized using stacked restricted Boltzmann machines (RBMs) and trained using cross-entropy as the objective function and the standard error backpropagation procedure in order to provide posterior probability estimates for the hidden Markov model (HMM) states. Emission densities of HMM states are represented as Gaussian mixture models (GMMs). The recipes were modified based on the particularities of the Serbian language in order to achieve the optimal results. A corpus of approximately 90 hours of speech (21000 utterances) is used for the training. The performances are compared for two different sets of utterances between the baseline GMM-HMM algorithm and various DNN settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Delić, V., Sečujski, M., Jakovljević, N., Pekar, D., Mišković, D., Popović, B., Ostrogonac, S., Bojanić, M., Knežević, D.: Speech and language resources within speech recognition and synthesis systems for Serbian and kindred south slavic languages. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 319–326. Springer, Heidelberg (2013)
Chapter Google Scholar
Young, S.J., Odell, J., Woodland, P.C.: Tree-based state tying for high accuracy acoustic modelling. In: ARPA Human Language Technology Workshop, pp. 307–312, Princeton (1994)
Google Scholar
Jakovljević, N., Mišković, D., Janev, M., Pekar, D.: A decoder for large vocabulary speech recognition. In: 18th International Conference on Systems, Signals and Image Processing, IWSSIP, pp. 1–4, Sarajevo (2011)
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 1–4, Waikoloa (2011)
Google Scholar
Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16, 69–88 (2002)
Article Google Scholar
Blackford, L.S., et al.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28(2), 135–151 (2002)
Article MathSciNet Google Scholar
Anderson, E., et al.: LAPACK Users’ Guide. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1999)
Book Google Scholar
Popović, B., Pakoci, E., Ostrogonac, S., Pekar, D.: Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit. In: 10th Digital Speech and Image Processing, DOGS, pp. 31–34, Novi Sad (2014)
Google Scholar
Veselý, K., Arnab, G., Lukáš, B., Povey, D.: Sequence-discriminative training of deep neural networks. In: International Speech Communication Association, Interspeech 2013, pp. 2345–2349, Lyon (2013)
Google Scholar
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: 33rd International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4057–4060, Las Vegas (2008)
Google Scholar
Povey D., Woodland, P.C.: Minimum phone error and i-smoothing for improved discriminative training. In: 27th International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. I-105–I-108, Orlando (2002)
Google Scholar
Povey, D., Kuo, H-K.J., Soltau, H.: Fast speaker adaptive training for speech recognition. In: 9th Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1245–1248, Brisbane (2008)
Google Scholar
Povey, D., et al.: The subspace Gaussian mixture model - a structured model for speech recognition. Comput. Speech Lang. 25, 404–439 (2011)
Article Google Scholar
Carreira-Perpiñán, M., Hinton, G.: On contrastive divergence learning. In: 10th International Workshop on Artifitial Intelligence and Statistic, AISTATS, pp. 59–66, Barbados (2005)
Google Scholar
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Waikoloa (2011)
Google Scholar
Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: 20th International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 181–184, Detroit (1995)
Google Scholar

Download references

Acknowledgments

The work described in this paper was supported in part by the Ministry of Education, Science and Technological Development of the Republic of Serbia, within the project TR32035: “Development of Dialogue Systems for Serbian and Other South Slavic Languages”.

Author information

Authors and Affiliations

Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia
Branislav Popović, Stevan Ostrogonac, Edvin Pakoci, Nikša Jakovljević & Vlado Delić

Authors

Branislav Popović
View author publications
You can also search for this author in PubMed Google Scholar
Stevan Ostrogonac
View author publications
You can also search for this author in PubMed Google Scholar
Edvin Pakoci
View author publications
You can also search for this author in PubMed Google Scholar
Nikša Jakovljević
View author publications
You can also search for this author in PubMed Google Scholar
Vlado Delić
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Branislav Popović .

Editor information

Editors and Affiliations

SPIIRAS, Saint-Petersburg, Russia
Andrey Ronzhin
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Patras, Patras, Greece
Nikos Fakotakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Popović, B., Ostrogonac, S., Pakoci, E., Jakovljević, N., Delić, V. (2015). Deep Neural Network Based Continuous Speech Recognition for Serbian Using the Kaldi Toolkit. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds) Speech and Computer. SPECOM 2015. Lecture Notes in Computer Science(), vol 9319. Springer, Cham. https://doi.org/10.1007/978-3-319-23132-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-23132-7_23
Published: 04 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23131-0
Online ISBN: 978-3-319-23132-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics