Abstract
Training procedures of a deep neural network are still an area with ample research possibilities and constant improvement either to increase its efficiency or its time performance. One of the lesser-addressed components is its objective function, which is an underlying aspect to consider when there is the necessity to achieve better error rates in the area of automatic speech recognition. The aim of this paper is to present two new variations of the frame-level cost function for training a deep neural network with the purpose of obtaining superior word error rates in speech recognition applied to a case study in Spanish. The first proposed function is a fusion between the boosted cross-entropy and the so called cross-entropy/log-posterior-ratio. The main idea is to jointly emphasize the prediction of difficult/crucial frames provided by a boosting factor and at the same time enlarge the distance between the target senone and its closest competitor. The second proposal is a fusion between the non-uniform mapped cross-entropy and the cross-entropy/log-posterior-ratio. This function utilizes both the mapped function to enhance the frames that have ambiguity in their belonging to specific senones and the log-posterior-ratio with the purpose of separating the target senone against the most competing tied tri-phone state. The proposed approaches are compared against those frame-level cost functions discussed in the state of the art. This comparative has been made by using a personalized mid-vocabulary speaker-independent voice corpus. This dataset is employed for the recognition of digit strings and personal name lists in Spanish from the northern central part of México on a connected-words phone dialing task. A relative word error rate improvement of 15.14% and 12.30% is obtained with the two proposed approaches, respectively, against the plain well-established cross-entropy loss function.
Similar content being viewed by others
References
Ali A, Zhang Y, Cardinal P, Dahak N, Vogel S, Glass J (2014) A complete kaldi recipe for building arabic speech recognition systems. In: Proceeedings of IEEE workshop spoken language technology (SLT). https://doi.org/10.1109/SLT.2014.7078629, pp 525–529
Allauzen C, Riley M, Schalkwyk J, Skut W, Mohri M (2007) Openfst: a general and efficient weighted finite-state transducer library. In: Proceedings of int conf. on implementation and application of automata (CIAA), pp 11–23
Almási AD, Woźniak S, Woźniak W, Cristea V, Leblebici Y, Engbersen T (2015) Review of advances in neural networks: neural design technology stack. Neurocomputing 174:31–41. https://doi.org/10.1016/j.neucom.2015.02.092
Alpaydin E (2010) Introduction to machine learning. MIT Press, Massachusetts
Anusuya MA, Katti SK (2009) Speech recognition by machine: a review. Int J Comput Sci Inf Secur 6(2):181–205
Astudillo RF, Correia J, Trancoso I (2015) Integration of DNN based speech enhancement and ASR. In: Proceedings of Interspeech, pp 3576–3580
Bacchiani M, Senior A, Heigold G (2014) Asynchronous, online, gmm-free training of a context dependent acoustic model for speech recognition. In: Proceedings of Interspeech, pp 1900–1904
Becerra A, de la Rosa J, González E (2016) A case study of speech recognition in Spanish: from conventional to deep approach. In: Proceedings of IEEE ANDESCON. https://doi.org/10.1109/ANDESCON.2016.7836212
Becerra A, de la Rosa J, González E (2017) Speech recognition using deep neural networks trained with non-uniform frame-level cost functions. In: Proceedings of IEEE international autumn meeting on power, electronics and computing (ROPEC). https://doi.org/10.1109/ROPEC.2017.8261588
Becerra A, de la Rosa J, González E (2018) Speech recognition in a dialog system: from conventional to deep processing. a case study applied to Spanish. Multimed Tools Appl 12(77):15,875–15,911. https://doi.org/10.1007/s11042-017-5160-5
Becerra A, de la Rosa J, González E, Pedroza A, Escalante N (2018) Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition. Multimed Tools Appl 20(77):27,231–27,267. https://doi.org/10.1007/s11042-018-5917-5
Bengio Y (2009) Learning deep architectures for ai. Found Trends Mach Learn 2(1):1–127. https://doi.org/10.1561/2200000006
Bilmes J (2006) What hmms can do. IEICE Trans Inf Syst E E89-D(3):869–891
Bishop C (2006) Pattern recognition and machine learning. Springer, New York
Bourlard H, Morgan N (1993) Connectionist speech recognition: a hybrid approach. Kluwer Academic Publishers, Norwell
Bourlard H, Morgan N (1994) Connectionist speech recognition: a hybrid approach. Klumer Academic Publishers, Boston. https://doi.org/10.1007/978-1-4615-3210-1
Burbea J, Rao R (1982) On the convexity of some divergence measures based on entropy functions. IEEE Trans Inf Theory 28(3):489–495
Cao W, Wang X, Ming Z, Gao J (2018) A review on neural networks with random weights. Neurocomputing 275:278–287. https://doi.org/10.1016/j.neucom.2017.08.040
Chen X, Eversole A, Li G, Yu D, Seide F (2012) Pipelined back-propagation for context-dependent deep neural networks. In: Proceedings of Interspeech
Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42
Deng L, Kenny P, Lennig M, Gupta V, Seitz F, Mermelstein P (1991) Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans Signal Process 39(7):1677–1681
Deng L, Li X (2013) Machine learning paradigms for speech recognition: an overview. IEEE Trans Audio Speech, Lang Process 21(5):1060–1089
Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, New York
Gauvain J, Ch L (1994) Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2 (2):291–298
Ge Z, Iyer AN, Cheluvaraja S, Sundaram R, Ganapathiraju A (2017) Neural network based speaker classification and verification systems with enhanced features. In: Proceedings of intelligent systems conference
Golik P, Doetsch P (2013) Cross-entropy vs. squared error training: a theoretical and experimental comparison. In: Proceedings of Interspeech, pp 1756–1760
Hagan MT, Demuth HB, Beale MH, de Jesús O (2014) Neural network design. CreateSpace, US
Haykin S (2009) Neural networks and learning machines. Pearson Education, New Jersey
Heigold G, Ney H, Schlüter R (2013) Investigations on an em-style optimization algorithm for discriminative training of hmms. IEEE Trans Audio Speech Lang Process 21(12):2616–2626
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Proc Mag 29(6):82–97
Hinton G, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Huang JT, Li J, Gong Y (2015) An analysis of convolutional neural networks for speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 4989–4993
Huang Y, Yu D, Liu C, Gong Y (2014) A comparative analytic study on the Gaussian mixture and context dependent deep neural network hidden Markov models. In: Proceedings of Interspeech, pp 1895–1899
Huang Z, Li J, Ch W, Ch L (2014) Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition. In: Proceedings of Interspeech, pp 1214–1218
Hwang M, Huang X (1993) Shared-distribution hidden Markov models for speech recognition. IEEE Trans Audio Speech Lang Process 1(4):414–420
Jaitly N (2014) Exploring deep learning methods for discovering features in speech signals. Ph.D. thesis, University of Toronto
Juang BH, Levinson SE, Sondhi M (1986) Maximum likelihood estimation for multivariate mixture observations of Markov chains. IEEE Trans Inform Theory IT-32(2):307–309
Jurafsky D, Martin J (2008) Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition. Pearson, NJ
Kingsbury B, Sainath TN, Soltau H (2012) Scalable minimum Bayes risk training of deep neural network acoustic models using distributed hessian-free optimization. In: Proceedings of InterSpeech
Lad F, Sanfilippo G, Agró G (2015) Extropy: complementary dual of entropy. Stat Sci 30(1):40–58
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Li G, Deng L, Tian L, Cui H, Han W, Pei J, Shi L (2017) Training deep neural networks with discrete state transition. Neurocomputing 272:154–162. https://doi.org/10.1016/j.neucom.2017.06.058
Li X, Wu X (2014) Labeling unsegmented sequence data with dnn-hmm and its application for speech recognition. In: Proceedings of int. symp. on Chinese spoken language processing (ISCSLP)
Li X, Yang Y, Pang Z, Wu X (2015) A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary chinese speech recognition. Neurocomputing 170(C):251–256
Liao Y, Lee H, Lee L (2015) Towards structured deep neural network for automatic speech recognition. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU). https://doi.org/10.1109/ASRU.2015.7404786
Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi F (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26. https://doi.org/10.1016/j.neucom.2016.12.038
M G, Young SJ (2007) The application of hidden Markov models in speech recognition. Found Trends Signal Process 1(3):195–304
McLachlan G (1988) Mixture models. Marcel Dekker, New York
Miao Y, Metze F (2013) Improving low-resource cd-dnn-hmm using dropout and multilingual dnn training. In: Proceedings of InterSpeech, pp 2237–2241
Mohamed A, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22
Morgan N, Bourlard H (1995) An introduction to hybrid hmm/connectionist continuous speech recognition. IEEE Signal Process Mag 12(3):25–42
Noguchi H, Miura K, Fujinaga T, Sugahara T, Kawaguchi H, Yoshimoto M (2011) Vlsi architecture of gmm processing and viterbi decoder for 60,000-word real-time continuous speech recognition. IEICE Trans Electron E94C (4):458–467
Pan J, Liu C, Wang Z, Hu Y, Jiang H (2012) Investigation of deep neural networks (dnn) for large vocabulary continuous speech recognition: why dnn surpass gmms in acoustic modeling. In: Proceedings of international symposium on chinese spoken language processing, pp 301–305
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The kaldi speech recognition toolkit. In: Proceedings of IEEE automatic speech recognition and understanding workshop (ASRU)
Prieto A, Prieto B, Ortigosa EM, Ros E, Pelayo F, Ortega J, Rojas I (2016) Neural networks: an overview of early research, current frameworks and new challenges. Neurocomputing 214:242–268. https://doi.org/10.1016/j.neucom.2016.06.014
Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of IEEE, vol 77, pp 257–286
Rabiner L, Juang B (1993) Fundamentals of speech recognition. Prentice-Hall, New Jersey
Rabiner L, Schafer R (2007) Introduction to digital speech processing. Found Trends Signal Process 1(1-2):1–194
Rao R (1984) Use of diversity and distance measures in the analysis of qualitative data. In: Van Vark GN, Howells WW (eds) Multivariate statistical methods in physical anthropology. D. Reidel Publishing Company, Dordrecht, pp 49–67
Rath S, Povey D, Vesel K, Cernock J (2013) Improved feature processing for deep neural networks. In: Proceedings of Interspeech, pp 109–113
Ray J, Thompson B, Shen W (2014) Comparing a high and low-level deep neural network implementation for automatic speech recognition. In: Proceedings of workshop for high performance technical computing in dynamic languages (HPTCDL), pp 41–46
Reynolds DA, Quatieri TF, Trb D (2000) Speaker verification using adapted gaussian mixture models. Digital Signal Process 10(1):19–41
Richard MD, Lippmann RP (1991) Neural network classifiers estimate bayesian a posteriori probabilities. Neural Comput 3(4):461–483. https://doi.org/10.1162/neco.1991.3.4.461
Robinson T (1994) An application of recurrent nets to phone probability estimation. IEEE Trans Neural Netw 5(3):1–16
Sainath T, Kingsbury B, Mohamed AR, E Dahl G, Saon G, Soltau H, Beran T, Aravkin A, Ramabhadran B (2013) Improvements to deep convolutional neural networks for lvcsr. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU). https://doi.org/10.1109/ASRU.2013.6707749
Sainath TN, Kingsbury B, Ramabhadran B (2012) Improving training time of deep belief networks through hybrid pre-training and larger batch sizes. In: Proceedings of neural information processing systems, workshop on log-linear models
Sainath TN, Kingsbury B, Ramabhadran B, Fousek P, Novak P, Mohamed A (2011) Making deep belief networks effective for large vocabulary continuous speech recognition. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU), pp 30–35. https://doi.org/10.1109/ASRU.2011.6163900
Sainath TN, Kingsbury B, Soltau H, Ramabhadran B (2013) Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans Audio Speech Lang Process 21(11):2267–2276
Sainath TN, Mohamed A, Kingsbury B, Ramabhadran B (2013) Deep convolutional neural networks for lvcsr. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 8614–8618
Scowen R (1993) Extended bnf - generic base standards. In: Proceedings of software engineering standards symp, pp 25–34
Seide F, Li G, Chen X, Yu D (2011) Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU), pp 24–29
Seide F, Li G, Yu D (2011) Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of Interspeech, pp 437–440
Seki H, Yamamoto K, Nakagawa S (2014) Comparison of syllable-based and phoneme-based dnn-hmm in japanese speech recognition. In: Proceedings of int conf. of advanced informatics concept, theory and application (ICAICTA), pp 249–254
Seltzer ML, Yu D, Wang Y (2013) An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 7398–7402
Senior A, Heigold G, Bacchiani M, Liao H (2014) Gmm-free dnn training. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 5639–5643
Siniscalchi SM, Svendsen T, Lee CH (2014) An artificial neural network approach to automatic speech processing. Neurocomputing 140:326–338
Su H, Li G, Yu D, Seide F (2013) Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceeedings of IEEE international conference on acoustics, speech and signal processing, pp 6664–6668
Trentin E, Gori M (2001) A survey of hybrid ann/hmm models for automatic speech recognition. Neurocomputing 37(1-4):91–126
Vesely K, Ghoshal A, Burget L, Povey D (2013) Sequence-discriminative training of deep neural networks. In: Proceedings of Interspeech, pp 2345–2349
Vesely K, Hannemann M, Burget L (2013) Semi-supervised training of deep neural networks. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU), pp 267–272
Veselý K, Vesel K (2010) Parallel training of neural networks for speech recognition. In: Proceedings of Interspeech, pp 2934–2937. https://doi.org/10.1007/b100511
Vincent P, Larochelle H, Bengio Y, Manzagol P (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of int conf on machine learning (ICML), pp 1096–1103
Wang G, Sim KC (2011) Sequential classification criteria for NNs in automatic speech recognition. In: Proceedings of Interspeech, pp 441–444
Wang X, Wang L, Chen J, Wu L (2016) Toward a better understanding of deep neural network∖r∖nBased acoustic modelling: an empirical investigation. In: Proceedings of 30th conference on artificial intelligence (AAAI 2016), pp 2173–2179
Wei W, van Vuuren S (1998) Improved neural network training of inter-word context. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 1520–6149. https://doi.org/10.1109/ICASSP.1998.674476
Wiesler S, Golik P, Schluter R, Ney H (2015) Investigations on sequence training of neural networks. In: Proceedings of IEEE international conference on acoustics,speech and signal processing, pp 4565–4569
Xu Y, Du J, Dai LR, Lee C (2014) An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett 21(1):1070–9908
Xue S, Abdel-Hamid O, Jiang H, Dai L, Liu Q (2014) Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE Trans Audio Speech Lang Process 22(12):1713–1725
Yang Z, Zhong A, Carass A, Ying SH, Prince JL (2014) Deep learning for cerebellar ataxia classification and functional score regression. Lect Notes Comput Sci 8679:68–76
Yao K, Yu D, Seide F, Su H, Deng L, Gong Y (2012) Adaptation of context-dependent deep neural networks for automatic speech recognition. In: Proceedings of IEEE spoken language technology workshop (SLT), pp 366–369
Young S (1996) Large vocabulary continuous speech recognition: a review. IEEE Signal Process Mag 13(5):45–57
Young S (2008) Hmms and related speech recognition technologies. In: Benesty J (ed) Springer handbook of speech processing. Springer, Berlin, pp 539–558
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK book (for version 3.4). Cambridge University Engineering Department, UK
Yu D, Deng L (2015) Automatic speech recognition: a deep learning approach. Springer, London
Yu D, Deng L, Dahl GE (2010) Roles of pretraining and fine-tuning in context-dependent dnn-hmms for real-world speech recognition. In: Proceedings of NIPS workshop on deep learning and unsupervised feature learning
Yu D, Seide G, Li G, Deng L (2012) Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 4409–4412
Zeng Z, Liang N, Yang X, Hoi S (2018) Multi-target deep neural networks: Theoretical analysis and implementation. Neurocomputing 273:634–642. https://doi.org/10.1016/j.neucom.2017.08.044
Zhang C, Woodland PC (2014) Standalone training of context-dependent deep neural network acoustic models. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 5597–5601
Zhang G (2000) Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30(4):451–462. https://doi.org/10.1109/5326.897072
Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of int. conf. on machine learning (ICML), pp 919–926
Zhao R, Li J, Gong Y (2014) Variable-component deep neural network for robust speech recognition. In: Proceedings of Interspeech
Zhou P, Jiang H, Dai L, Hu Y, Liu Q (2015) State-clustering based multiple deep neural networks modeling approach for speech recognition. IEEE Trans Audio Speech Lang Process 23(4):631–642
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Becerra, A., Rosa, J.I.d., González, E. et al. A comparative case study of neural network training by using frame-level cost functions for automatic speech recognition purposes in Spanish. Multimed Tools Appl 79, 19669–19715 (2020). https://doi.org/10.1007/s11042-020-08782-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08782-0