Skip to main content
Log in

A comparative case study of neural network training by using frame-level cost functions for automatic speech recognition purposes in Spanish

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Training procedures of a deep neural network are still an area with ample research possibilities and constant improvement either to increase its efficiency or its time performance. One of the lesser-addressed components is its objective function, which is an underlying aspect to consider when there is the necessity to achieve better error rates in the area of automatic speech recognition. The aim of this paper is to present two new variations of the frame-level cost function for training a deep neural network with the purpose of obtaining superior word error rates in speech recognition applied to a case study in Spanish. The first proposed function is a fusion between the boosted cross-entropy and the so called cross-entropy/log-posterior-ratio. The main idea is to jointly emphasize the prediction of difficult/crucial frames provided by a boosting factor and at the same time enlarge the distance between the target senone and its closest competitor. The second proposal is a fusion between the non-uniform mapped cross-entropy and the cross-entropy/log-posterior-ratio. This function utilizes both the mapped function to enhance the frames that have ambiguity in their belonging to specific senones and the log-posterior-ratio with the purpose of separating the target senone against the most competing tied tri-phone state. The proposed approaches are compared against those frame-level cost functions discussed in the state of the art. This comparative has been made by using a personalized mid-vocabulary speaker-independent voice corpus. This dataset is employed for the recognition of digit strings and personal name lists in Spanish from the northern central part of México on a connected-words phone dialing task. A relative word error rate improvement of 15.14% and 12.30% is obtained with the two proposed approaches, respectively, against the plain well-established cross-entropy loss function.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31

Similar content being viewed by others

References

  1. Ali A, Zhang Y, Cardinal P, Dahak N, Vogel S, Glass J (2014) A complete kaldi recipe for building arabic speech recognition systems. In: Proceeedings of IEEE workshop spoken language technology (SLT). https://doi.org/10.1109/SLT.2014.7078629, pp 525–529

  2. Allauzen C, Riley M, Schalkwyk J, Skut W, Mohri M (2007) Openfst: a general and efficient weighted finite-state transducer library. In: Proceedings of int conf. on implementation and application of automata (CIAA), pp 11–23

  3. Almási AD, Woźniak S, Woźniak W, Cristea V, Leblebici Y, Engbersen T (2015) Review of advances in neural networks: neural design technology stack. Neurocomputing 174:31–41. https://doi.org/10.1016/j.neucom.2015.02.092

    Article  Google Scholar 

  4. Alpaydin E (2010) Introduction to machine learning. MIT Press, Massachusetts

    MATH  Google Scholar 

  5. Anusuya MA, Katti SK (2009) Speech recognition by machine: a review. Int J Comput Sci Inf Secur 6(2):181–205

    Google Scholar 

  6. Astudillo RF, Correia J, Trancoso I (2015) Integration of DNN based speech enhancement and ASR. In: Proceedings of Interspeech, pp 3576–3580

  7. Bacchiani M, Senior A, Heigold G (2014) Asynchronous, online, gmm-free training of a context dependent acoustic model for speech recognition. In: Proceedings of Interspeech, pp 1900–1904

  8. Becerra A, de la Rosa J, González E (2016) A case study of speech recognition in Spanish: from conventional to deep approach. In: Proceedings of IEEE ANDESCON. https://doi.org/10.1109/ANDESCON.2016.7836212

  9. Becerra A, de la Rosa J, González E (2017) Speech recognition using deep neural networks trained with non-uniform frame-level cost functions. In: Proceedings of IEEE international autumn meeting on power, electronics and computing (ROPEC). https://doi.org/10.1109/ROPEC.2017.8261588

  10. Becerra A, de la Rosa J, González E (2018) Speech recognition in a dialog system: from conventional to deep processing. a case study applied to Spanish. Multimed Tools Appl 12(77):15,875–15,911. https://doi.org/10.1007/s11042-017-5160-5

    Article  Google Scholar 

  11. Becerra A, de la Rosa J, González E, Pedroza A, Escalante N (2018) Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition. Multimed Tools Appl 20(77):27,231–27,267. https://doi.org/10.1007/s11042-018-5917-5

    Article  Google Scholar 

  12. Bengio Y (2009) Learning deep architectures for ai. Found Trends Mach Learn 2(1):1–127. https://doi.org/10.1561/2200000006

    Article  MATH  Google Scholar 

  13. Bilmes J (2006) What hmms can do. IEICE Trans Inf Syst E E89-D(3):869–891

    Google Scholar 

  14. Bishop C (2006) Pattern recognition and machine learning. Springer, New York

    MATH  Google Scholar 

  15. Bourlard H, Morgan N (1993) Connectionist speech recognition: a hybrid approach. Kluwer Academic Publishers, Norwell

    Google Scholar 

  16. Bourlard H, Morgan N (1994) Connectionist speech recognition: a hybrid approach. Klumer Academic Publishers, Boston. https://doi.org/10.1007/978-1-4615-3210-1

    Google Scholar 

  17. Burbea J, Rao R (1982) On the convexity of some divergence measures based on entropy functions. IEEE Trans Inf Theory 28(3):489–495

    MathSciNet  MATH  Google Scholar 

  18. Cao W, Wang X, Ming Z, Gao J (2018) A review on neural networks with random weights. Neurocomputing 275:278–287. https://doi.org/10.1016/j.neucom.2017.08.040

    Article  Google Scholar 

  19. Chen X, Eversole A, Li G, Yu D, Seide F (2012) Pipelined back-propagation for context-dependent deep neural networks. In: Proceedings of Interspeech

  20. Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42

    Google Scholar 

  21. Deng L, Kenny P, Lennig M, Gupta V, Seitz F, Mermelstein P (1991) Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans Signal Process 39(7):1677–1681

    Google Scholar 

  22. Deng L, Li X (2013) Machine learning paradigms for speech recognition: an overview. IEEE Trans Audio Speech, Lang Process 21(5):1060–1089

    Google Scholar 

  23. Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, New York

    MATH  Google Scholar 

  24. Gauvain J, Ch L (1994) Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2 (2):291–298

    Google Scholar 

  25. Ge Z, Iyer AN, Cheluvaraja S, Sundaram R, Ganapathiraju A (2017) Neural network based speaker classification and verification systems with enhanced features. In: Proceedings of intelligent systems conference

  26. Golik P, Doetsch P (2013) Cross-entropy vs. squared error training: a theoretical and experimental comparison. In: Proceedings of Interspeech, pp 1756–1760

  27. Hagan MT, Demuth HB, Beale MH, de Jesús O (2014) Neural network design. CreateSpace, US

  28. Haykin S (2009) Neural networks and learning machines. Pearson Education, New Jersey

    Google Scholar 

  29. Heigold G, Ney H, Schlüter R (2013) Investigations on an em-style optimization algorithm for discriminative training of hmms. IEEE Trans Audio Speech Lang Process 21(12):2616–2626

    Google Scholar 

  30. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Proc Mag 29(6):82–97

    Google Scholar 

  31. Hinton G, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    MathSciNet  MATH  Google Scholar 

  32. Huang JT, Li J, Gong Y (2015) An analysis of convolutional neural networks for speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 4989–4993

  33. Huang Y, Yu D, Liu C, Gong Y (2014) A comparative analytic study on the Gaussian mixture and context dependent deep neural network hidden Markov models. In: Proceedings of Interspeech, pp 1895–1899

  34. Huang Z, Li J, Ch W, Ch L (2014) Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition. In: Proceedings of Interspeech, pp 1214–1218

  35. Hwang M, Huang X (1993) Shared-distribution hidden Markov models for speech recognition. IEEE Trans Audio Speech Lang Process 1(4):414–420

    Google Scholar 

  36. Jaitly N (2014) Exploring deep learning methods for discovering features in speech signals. Ph.D. thesis, University of Toronto

  37. Juang BH, Levinson SE, Sondhi M (1986) Maximum likelihood estimation for multivariate mixture observations of Markov chains. IEEE Trans Inform Theory IT-32(2):307–309

    Google Scholar 

  38. Jurafsky D, Martin J (2008) Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition. Pearson, NJ

  39. Kingsbury B, Sainath TN, Soltau H (2012) Scalable minimum Bayes risk training of deep neural network acoustic models using distributed hessian-free optimization. In: Proceedings of InterSpeech

  40. Lad F, Sanfilippo G, Agró G (2015) Extropy: complementary dual of entropy. Stat Sci 30(1):40–58

    MathSciNet  MATH  Google Scholar 

  41. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444

    Google Scholar 

  42. Li G, Deng L, Tian L, Cui H, Han W, Pei J, Shi L (2017) Training deep neural networks with discrete state transition. Neurocomputing 272:154–162. https://doi.org/10.1016/j.neucom.2017.06.058

    Article  Google Scholar 

  43. Li X, Wu X (2014) Labeling unsegmented sequence data with dnn-hmm and its application for speech recognition. In: Proceedings of int. symp. on Chinese spoken language processing (ISCSLP)

  44. Li X, Yang Y, Pang Z, Wu X (2015) A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary chinese speech recognition. Neurocomputing 170(C):251–256

    Google Scholar 

  45. Liao Y, Lee H, Lee L (2015) Towards structured deep neural network for automatic speech recognition. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU). https://doi.org/10.1109/ASRU.2015.7404786

  46. Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi F (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26. https://doi.org/10.1016/j.neucom.2016.12.038

    Article  Google Scholar 

  47. M G, Young SJ (2007) The application of hidden Markov models in speech recognition. Found Trends Signal Process 1(3):195–304

    MATH  Google Scholar 

  48. McLachlan G (1988) Mixture models. Marcel Dekker, New York

    MATH  Google Scholar 

  49. Miao Y, Metze F (2013) Improving low-resource cd-dnn-hmm using dropout and multilingual dnn training. In: Proceedings of InterSpeech, pp 2237–2241

  50. Mohamed A, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22

    Google Scholar 

  51. Morgan N, Bourlard H (1995) An introduction to hybrid hmm/connectionist continuous speech recognition. IEEE Signal Process Mag 12(3):25–42

    Google Scholar 

  52. Noguchi H, Miura K, Fujinaga T, Sugahara T, Kawaguchi H, Yoshimoto M (2011) Vlsi architecture of gmm processing and viterbi decoder for 60,000-word real-time continuous speech recognition. IEICE Trans Electron E94C (4):458–467

    Google Scholar 

  53. Pan J, Liu C, Wang Z, Hu Y, Jiang H (2012) Investigation of deep neural networks (dnn) for large vocabulary continuous speech recognition: why dnn surpass gmms in acoustic modeling. In: Proceedings of international symposium on chinese spoken language processing, pp 301–305

  54. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The kaldi speech recognition toolkit. In: Proceedings of IEEE automatic speech recognition and understanding workshop (ASRU)

  55. Prieto A, Prieto B, Ortigosa EM, Ros E, Pelayo F, Ortega J, Rojas I (2016) Neural networks: an overview of early research, current frameworks and new challenges. Neurocomputing 214:242–268. https://doi.org/10.1016/j.neucom.2016.06.014

    Article  Google Scholar 

  56. Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of IEEE, vol 77, pp 257–286

  57. Rabiner L, Juang B (1993) Fundamentals of speech recognition. Prentice-Hall, New Jersey

    Google Scholar 

  58. Rabiner L, Schafer R (2007) Introduction to digital speech processing. Found Trends Signal Process 1(1-2):1–194

    MATH  Google Scholar 

  59. Rao R (1984) Use of diversity and distance measures in the analysis of qualitative data. In: Van Vark GN, Howells WW (eds) Multivariate statistical methods in physical anthropology. D. Reidel Publishing Company, Dordrecht, pp 49–67

  60. Rath S, Povey D, Vesel K, Cernock J (2013) Improved feature processing for deep neural networks. In: Proceedings of Interspeech, pp 109–113

  61. Ray J, Thompson B, Shen W (2014) Comparing a high and low-level deep neural network implementation for automatic speech recognition. In: Proceedings of workshop for high performance technical computing in dynamic languages (HPTCDL), pp 41–46

  62. Reynolds DA, Quatieri TF, Trb D (2000) Speaker verification using adapted gaussian mixture models. Digital Signal Process 10(1):19–41

    Google Scholar 

  63. Richard MD, Lippmann RP (1991) Neural network classifiers estimate bayesian a posteriori probabilities. Neural Comput 3(4):461–483. https://doi.org/10.1162/neco.1991.3.4.461

    Article  Google Scholar 

  64. Robinson T (1994) An application of recurrent nets to phone probability estimation. IEEE Trans Neural Netw 5(3):1–16

    Google Scholar 

  65. Sainath T, Kingsbury B, Mohamed AR, E Dahl G, Saon G, Soltau H, Beran T, Aravkin A, Ramabhadran B (2013) Improvements to deep convolutional neural networks for lvcsr. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU). https://doi.org/10.1109/ASRU.2013.6707749

  66. Sainath TN, Kingsbury B, Ramabhadran B (2012) Improving training time of deep belief networks through hybrid pre-training and larger batch sizes. In: Proceedings of neural information processing systems, workshop on log-linear models

  67. Sainath TN, Kingsbury B, Ramabhadran B, Fousek P, Novak P, Mohamed A (2011) Making deep belief networks effective for large vocabulary continuous speech recognition. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU), pp 30–35. https://doi.org/10.1109/ASRU.2011.6163900

  68. Sainath TN, Kingsbury B, Soltau H, Ramabhadran B (2013) Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans Audio Speech Lang Process 21(11):2267–2276

    Google Scholar 

  69. Sainath TN, Mohamed A, Kingsbury B, Ramabhadran B (2013) Deep convolutional neural networks for lvcsr. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 8614–8618

  70. Scowen R (1993) Extended bnf - generic base standards. In: Proceedings of software engineering standards symp, pp 25–34

  71. Seide F, Li G, Chen X, Yu D (2011) Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU), pp 24–29

  72. Seide F, Li G, Yu D (2011) Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of Interspeech, pp 437–440

  73. Seki H, Yamamoto K, Nakagawa S (2014) Comparison of syllable-based and phoneme-based dnn-hmm in japanese speech recognition. In: Proceedings of int conf. of advanced informatics concept, theory and application (ICAICTA), pp 249–254

  74. Seltzer ML, Yu D, Wang Y (2013) An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 7398–7402

  75. Senior A, Heigold G, Bacchiani M, Liao H (2014) Gmm-free dnn training. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 5639–5643

  76. Siniscalchi SM, Svendsen T, Lee CH (2014) An artificial neural network approach to automatic speech processing. Neurocomputing 140:326–338

    Google Scholar 

  77. Su H, Li G, Yu D, Seide F (2013) Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceeedings of IEEE international conference on acoustics, speech and signal processing, pp 6664–6668

  78. Trentin E, Gori M (2001) A survey of hybrid ann/hmm models for automatic speech recognition. Neurocomputing 37(1-4):91–126

    MATH  Google Scholar 

  79. Vesely K, Ghoshal A, Burget L, Povey D (2013) Sequence-discriminative training of deep neural networks. In: Proceedings of Interspeech, pp 2345–2349

  80. Vesely K, Hannemann M, Burget L (2013) Semi-supervised training of deep neural networks. In: Proceedings of IEEE conference of automatic speech recognition and understanding workshop (ASRU), pp 267–272

  81. Veselý K, Vesel K (2010) Parallel training of neural networks for speech recognition. In: Proceedings of Interspeech, pp 2934–2937. https://doi.org/10.1007/b100511

  82. Vincent P, Larochelle H, Bengio Y, Manzagol P (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of int conf on machine learning (ICML), pp 1096–1103

  83. Wang G, Sim KC (2011) Sequential classification criteria for NNs in automatic speech recognition. In: Proceedings of Interspeech, pp 441–444

  84. Wang X, Wang L, Chen J, Wu L (2016) Toward a better understanding of deep neural network∖r∖nBased acoustic modelling: an empirical investigation. In: Proceedings of 30th conference on artificial intelligence (AAAI 2016), pp 2173–2179

  85. Wei W, van Vuuren S (1998) Improved neural network training of inter-word context. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 1520–6149. https://doi.org/10.1109/ICASSP.1998.674476

  86. Wiesler S, Golik P, Schluter R, Ney H (2015) Investigations on sequence training of neural networks. In: Proceedings of IEEE international conference on acoustics,speech and signal processing, pp 4565–4569

  87. Xu Y, Du J, Dai LR, Lee C (2014) An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett 21(1):1070–9908

    Google Scholar 

  88. Xue S, Abdel-Hamid O, Jiang H, Dai L, Liu Q (2014) Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE Trans Audio Speech Lang Process 22(12):1713–1725

    Google Scholar 

  89. Yang Z, Zhong A, Carass A, Ying SH, Prince JL (2014) Deep learning for cerebellar ataxia classification and functional score regression. Lect Notes Comput Sci 8679:68–76

    Google Scholar 

  90. Yao K, Yu D, Seide F, Su H, Deng L, Gong Y (2012) Adaptation of context-dependent deep neural networks for automatic speech recognition. In: Proceedings of IEEE spoken language technology workshop (SLT), pp 366–369

  91. Young S (1996) Large vocabulary continuous speech recognition: a review. IEEE Signal Process Mag 13(5):45–57

    Google Scholar 

  92. Young S (2008) Hmms and related speech recognition technologies. In: Benesty J (ed) Springer handbook of speech processing. Springer, Berlin, pp 539–558

  93. Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK book (for version 3.4). Cambridge University Engineering Department, UK

  94. Yu D, Deng L (2015) Automatic speech recognition: a deep learning approach. Springer, London

    MATH  Google Scholar 

  95. Yu D, Deng L, Dahl GE (2010) Roles of pretraining and fine-tuning in context-dependent dnn-hmms for real-world speech recognition. In: Proceedings of NIPS workshop on deep learning and unsupervised feature learning

  96. Yu D, Seide G, Li G, Deng L (2012) Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 4409–4412

  97. Zeng Z, Liang N, Yang X, Hoi S (2018) Multi-target deep neural networks: Theoretical analysis and implementation. Neurocomputing 273:634–642. https://doi.org/10.1016/j.neucom.2017.08.044

    Article  Google Scholar 

  98. Zhang C, Woodland PC (2014) Standalone training of context-dependent deep neural network acoustic models. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 5597–5601

  99. Zhang G (2000) Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30(4):451–462. https://doi.org/10.1109/5326.897072

    Article  Google Scholar 

  100. Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of int. conf. on machine learning (ICML), pp 919–926

  101. Zhao R, Li J, Gong Y (2014) Variable-component deep neural network for robust speech recognition. In: Proceedings of Interspeech

  102. Zhou P, Jiang H, Dai L, Hu Y, Liu Q (2015) State-clustering based multiple deep neural networks modeling approach for speech recognition. IEEE Trans Audio Speech Lang Process 23(4):631–642

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aldonso Becerra.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Becerra, A., Rosa, J.I.d., González, E. et al. A comparative case study of neural network training by using frame-level cost functions for automatic speech recognition purposes in Spanish. Multimed Tools Appl 79, 19669–19715 (2020). https://doi.org/10.1007/s11042-020-08782-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08782-0

Keywords

Navigation