Tuning of Acoustic Modeling and Adaptation Technique for a Real Speech Recognition Task

Vaněk, Jan; Michálek, Josef; Psutka, Josef

doi:10.1007/978-3-030-31372-2_20

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11816))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

657 Accesses
1 Citations

Abstract

At the beginning, we had started to develop a Czech telephone acoustic model by evaluating various Kaldi recipes. We had a 500-h Czech telephone Switchboard-like corpus. We had selected the Time-Delay Neural Network (TDNN) model variant “d” with the i-vector adaptation as the best performing model on the held-out set from the corpus. The TDNN architecture with an asymmetric time-delay window also fulfilled our real-time application constrain. However, we were wondering why the model totally failed on a real call center task. The main problem was in the i-vector estimation procedure. The training data are split into short utterances. In the recipe, 2-utterance pseudospeakers are made and i-vectors are evaluated for them. However, the real call center utterances are much longer, in order of several minutes or even more. The TDNN model was trained from i-vectors that did not match the test ones. We propose two ways how to normalize statistics used for the i-vector estimation. The test data i-vectors with the normalization are better compatible with the training data i-vectors. In the paper, we also discuss various additional ways of improving the model accuracy on the out-of-domain real task including using LSTM based models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, X., et al.: Recurrent neural network language model adaptation for multi-genre broadcast speech recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hsu, W.N., Zhang, Y., Glass, J.: Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 16–23. IEEE (2017)
Google Scholar
Karafiát, M., Burget, L., Matějka, P., Glembek, O., Černockỳ, J.: iVector-based discriminative adaptation for automatic speech recognition. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 152–157. IEEE (2011)
Google Scholar
Karanasou, P., Wang, Y., Gales, M.J., Woodland, P.C.: Adaptation of deep neural network acoustic models using factorised I-Vectors. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)
Pollak, P., et al.: SpeechDat(E) - Eastern European telephone speech databases. In: Proceedings LREC 2000 Satelite workshop XLDB - Very Large Telephone Speech Databases, pp. 20–25. European Language Resources Association, Athens (2000)
Google Scholar
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (Dec 2011), iEEE Catalog No.: CFP11SRW-USB
Google Scholar
Pražák, A., Müller, L., Šmídl, L.: Real-time decoder for LVCSR system. In: The 8th World Multi-Conference on Systemics, Cybernetics and Informatics: vol. VI : Image, Acoustic, Signal Processing and Optical Systems, technologies and applications, pp. 450–454. International Institute of Informatics and Systemics, Orlando, Florida (2004). http://www.kky.zcu.cz/en/publications/PrazakA_2004_Real-timedecoderfor
Saon, G., Soltau, H.: Speaker adaptation of neural network acoustic models using I-Vectors. In: IEEE Workshop on Automatic Speech Recognition and Understanding pp. 55–59 (2013). https://doi.org/10.1109/ASRU.2013.6707705,http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6707705
Sim, K.C., et al.: Domain adaptation using factorized hidden layer for robust automatic speech recognition. In: Proceedings of the INTERSPEECH (2018)
Google Scholar
Sim, K.C., Qian, Y., Mantena, G., Samarakoon, L., Kundu, S., Tan, T.: Adaptation of deep neural network acoustic models for robust automatic speech recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J.R. (eds.) New Era for Robust Speech Recognition, pp. 219–243. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64680-0_9
Chapter Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. Backpropagation: Theory, Architectures and Applications, pp. 35–61 (1995)
Google Scholar
Yu, D., Li, J.: Recent progresses in deep learning based acoustic models. IEEE/CAA J. Automatica Sin. 4(3), 396–409 (2017)
Article MathSciNet Google Scholar

Download references

Acknowledgement

This research was supported by the LINDAT/CLARIN, project of the Ministry of Education of the Czech Republic No. CZ.02.1.01/0.0/0.0/16_013/0001781.

Author information

Authors and Affiliations

University of West Bohemia, Univerzitní 8, 301 00, Pilsen, Czech Republic
Jan Vaněk, Josef Michálek & Josef Psutka

Authors

Jan Vaněk
View author publications
You can also search for this author in PubMed Google Scholar
Josef Michálek
View author publications
You can also search for this author in PubMed Google Scholar
Josef Psutka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Josef Michálek .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
Queen Mary University of London, London, UK
Matthew Purver
Jožef Stefan Institute, Ljubljana, Slovenia
Senja Pollak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vaněk, J., Michálek, J., Psutka, J. (2019). Tuning of Acoustic Modeling and Adaptation Technique for a Real Speech Recognition Task. In: Martín-Vide, C., Purver, M., Pollak, S. (eds) Statistical Language and Speech Processing. SLSP 2019. Lecture Notes in Computer Science(), vol 11816. Springer, Cham. https://doi.org/10.1007/978-3-030-31372-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-31372-2_20
Published: 27 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31371-5
Online ISBN: 978-3-030-31372-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics