Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition

Becerra, Aldonso; de la Rosa, J. Ismael; González, Efrén; Pedroza, A. David; Escalante, N. Iracemi

doi:10.1007/s11042-018-5917-5

Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition

Published: 27 March 2018

Volume 77, pages 27231–27267, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Aldonso Becerra ORCID: orcid.org/0000-0002-4274-4396¹,
J. Ismael de la Rosa¹,
Efrén González¹,
A. David Pedroza¹ &
…
N. Iracemi Escalante²

387 Accesses
16 Citations
Explore all metrics

Abstract

The aim of this paper is to exhibit two new variations of the frame-level cost function for training a deep neural network in order to achieve better word error rates in speech recognition. Optimization methods and their minimization functions are underlying aspects to consider when someone is working on neural nets, and hence their improvement is one of the salient objectives of researchers, and this paper deals in part with such a situation. The first proposed framework is based on the concept of extropy, the complementary dual function of an uncertainty measure. The conventional cross-entropy function can be mapped to a non-uniform loss function based on its corresponding extropy, enhancing the frames that have ambiguity in their belonging to specific senones. The second proposal makes a fusion of the presented mapped cross-entropy function and the idea of boosted cross-entropy, which emphasizes those frames with low target posterior probability. The proposed approaches have been performed by using a personalized mid-vocabulary speaker-independent voice corpus. This dataset is employed for recognition of digit strings and personal name lists in Spanish from the northern central part of Mexico on a connected-words phone dialing task. A relative word error rate improvement of \(12.3\%\) and \(10.7\%\) is obtained with the two proposed approaches, respectively, with regard to the conventional well-established cross-entropy objective function.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative case study of neural network training by using frame-level cost functions for automatic speech recognition purposes in Spanish

Article 27 March 2020

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Article Open access 13 January 2016

Automatic Speech Recognition Based on Neural Networks

References

Ali A, Zhang Y, Cardinal P, Dahak N, Vogel S, Glass J (2014) A complete KALDI recipe for building Arabic speech recognition systems. In: Proceedings of spoken language technology (SLT), pp 525–529
Allauzen C, Riley M, Schalkwyk J, Skut W, Mohri M (2007) OpenFst: a general and efficient weighted finite-state transducer library. In: Proceedings of int. conf. on implementation and application of automata (CIAA), pp 11–23
Bacchiani M, Senior A, Heigold G (2014) Asynchronous, Online, GMM-free training of a context dependent acoustic model for speech recognition. In: Proceedings of Interspeech, pp 1900–1904
Becerra A, de la Rosa JI, González E (2016) A case study of speech recognition in Spanish: from conventional to deep approach. In: Proceedings of IEEE ANDESCON
Becerra A, de la Rosa JI, González E (2017) Speech recognition in a dialog system: from conventional to deep processing. A case study applied to Spanish. Multimed Tools Appl. https://doi.org/10.1007/s11042-017-5160-5
Article Google Scholar
Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127. https://doi.org/10.1561/2200000006
Article MathSciNet MATH Google Scholar
Bilmes J (2006) What HMMs can do. IEICE Trans Inf Syst E89-D(3):869–891
Article Google Scholar
Bishop C (2006) Pattern recognition and machine learning. Springer, NY
MATH Google Scholar
Bourlard H, Morgan N (1993) Connectionist speech recognition: a hybrid approach. Kluwer Academic Publishers, Norwell
Google Scholar
Burbea J, Rao R (1982) On the convexity of some divergence measures based on entropy functions. IEEE Trans Inf Theory 28(3):489–495
Article MathSciNet Google Scholar
Chen X, Eversole A, Li G, Yu D, Seide F (2012) Pipelined Back-Propagation for Context-Dependent deep neural networks. In: Proceedings of Interspeech
Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42
Article Google Scholar
Deng L (2014) A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans Signal Info Process 3(e2). https://doi.org/10.1017/atsip.2013.9
Deng L, Li X (2013) Machine learning paradigms for speech recognition: an overview. IEEE Trans Audio Speech Lang Process 21(5):1060–1089
Article Google Scholar
Deng L, Kenny P, Lennig M, Gupta V, Seitz F, Mermelstein P (1991) Phonemic hidden markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans Signal Process 39(7):1677–1681
Article Google Scholar
Duda R, Hart P, Stork D (2001) Pattern Classification. Wiley, NY
MATH Google Scholar
Gales MJF, Young SJ (2007) The application of hidden Markov models in speech recognition. Found Trends Signal Process 1(3):195–304
Article Google Scholar
Gauvain J, Lee C h (1994) Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans Speech Audio Process 2 (2):291–298
Article Google Scholar
Ge Z, Iyer AN, Cheluvaraja S, Sundaram R, Ganapathiraju A (2017) Neural network based speaker classification and verification systems with enhanced features. In: Proceedings of intelligent systems conference
Hagan MT, Demuth HB, Beale MH, De Jesús O (2014) Neural network design. CreateSpace, US
Google Scholar
Heigold G, Ney H, Schlüter R (2013) Investigations on an EM-style optimization algorithm for discriminative training of HMMs. IEEE Trans Audio Speech Lang Process 21(12):2616–2626
Article Google Scholar
Hinton G, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MathSciNet Google Scholar
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Proc Mag 29(6):82–97
Article Google Scholar
Huang Z, Li J, Weng Ch, Lee Ch (2014) Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition. In: Proceedings of Interspeech, pp 1214–1218
Juang BH, Levinson SE, Sondhi M (1986) Maximum likelihood estimation for multivariate mixture observations of markov chains. IEEE Trans Inf Theory IT-32(2):307–309
Article Google Scholar
Jurafsky D, Martin J (2008) Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition. Pearson, NJ
Google Scholar
Kingsbury B, Sainath TN, Soltau H (2012) Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In: Proceedings of InterSpeech
Lad F, Sanfilippo G, Agró G (2015) Extropy: complementary dual of entropy. Stat Sci 30(1):40–58
Article MathSciNet Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Article Google Scholar
Liao Y, Lee H, Lee L (2015) Towards structured deep neural network for automatic speech recognition. In: Proceedings of ASRU, https://doi.org/10.1109/ASRU.2015.7404786
Li X, Wu X (2014) Labeling unsegmented sequence data with DNN-HMM and its application for speech recognition. In: Proceedings of int. symp. on chinese spoken language processing (ISCSLP)
Li X, Hong C, Yang Y, Wu X (2013) Deep neural networks for syllable based acoustic modeling in Chinese speech recognition. In: Proceedings of signal and information processing association annu. summit and conf. (APSIPA)
Li X, Yang Y, Pang Z, Wu X (2015) A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition. Neurocomputing 170:251–256
Article Google Scholar
McLachlan G (1988) Mixture models. Marcel Dekker, New York
MATH Google Scholar
Mehrotra k, Mohan Ch, Ranka S (1997) Elements of artificial neural networks. MIT Press, Cambridge
MATH Google Scholar
Miao Y, Metze F, Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training (2013). In: Proceedings of InterSpeech, pp 2237–2241
Mohamed A, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22
Article Google Scholar
Morgan N, Bourlard H (1995) An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Proc Mag 12(3):25–42
Article Google Scholar
Pan J, Liu C, Wang Z, Hu Y, Jiang H (2012) Investigation of Deep Neural Networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpass GMMs in acoustic modeling. In: Proceedings of int. symp. on chinese spoken language processing (ISCSLP), pp 301–305
Povey S, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The Kaldi speech recognition toolkit. In: Proceedings of automatic speech recognition and understanding workshop (ASRU)
Rao R (1984) Use of diversity and distance measures in the analysis of qualitative data. In: Van Vark GN, Howells WW (eds) Multivariate statistical methods in physical anthropology. D. Reidel Publishing Company, Dordrecht, pp 49–67
Chapter Google Scholar
Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of IEEE 77(2):257–286
Article Google Scholar
Rabiner L, Juang B (1993) Fundamentals of speech recognition. Prentice-Hall, New Jersey
Google Scholar
Rath S, Povey D, Vesel K, Cernock J (2013) Improved feature processing for deep neural networks. In: Proceedings of Interspeech, pp 109–113
Ray J, Thompson B, Shen W (2014) Comparing a high and low-level deep neural network implementation for automatic speech recognition. In: Proceedings of workshop for high performance technical computing in dynamic languages (HPTCDL), pp 41–46
Reynolds DA, Quatieri TF, Dunn TRB (2000) Speaker verification using adapted gaussian mixture models. Digital Signal Process 10(1):19–41
Article Google Scholar
Sainath TN, Kingsbury B, Ramabhadran B, Fousek P, Novak P, Mohamed A (2011) Making deep belief networks effective for large vocabulary continuous speech recognition. In: Proceedings of automatic speech recognition and understanding workshop (ASRU)
Sainath TN, Kingsbury B, Soltau H, Ramabhadran B (2013) Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans Audio Speech Lang Process 21(11):2267–2276
Article Google Scholar
Scowen R (1993) Extended bnf - generic base standards. In: Proceedings of software engineering standards symp., pp 25–34
Seide F, Li G, Chen X, Yu D (2011) Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of automatic speech recognition and understanding workshop (ASRU), pp 24–29
Seide F, Li G, Yu D (2011) Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of Interspeech, pp 437–440
Seki H, Yamamoto K, Nakagawa S (2014) Comparison of syllable-based and phoneme-based DNN-HMM in Japanese speech recognition. In: Proceedings of int. conf. of advanced informatics concept, theory and application (ICAICTA), pp 249–254
Seltzer ML, Yu D, Wang Y (2013) An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of ICASSP, pp 7398–7402
Senior A, Heigold G, Bacchiani M, Liao H (2014) GMM-free DNN training. In: Proceedings of ICASSP, pp 5639–5643
Siniscalchi SM, Svendsen T, Lee Ch (2014) An artificial neural network approach to automatic speech processing. Neurocomputing 140:326–338
Article Google Scholar
Su H, Li G, Yu D, Seide F (2013) Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceeedings of ICASSP, pp 6664–6668
Tao D, Cheng Y, Song M, Lin X (2016) Manifold Ranking-Based matrix factorization for saliency detection. IEEE Trans Neural Netw Learn Syst 27(6):1122–1134
Article MathSciNet Google Scholar
Tao D, Lin X, Jin L, Li X (2016) Principal component 2-D long short-term memory for font recognition on single chinese characters. IEEE Trans Cybern 46(3):756–765
Article Google Scholar
Tao D, Guo Y, Song M, Li Y, Yu Z, Tang Y (2016) Person Re-identification by dual-regularized KISS metric learning. IEEE Trans Image Process 25(6):2726–2738
Article MathSciNet Google Scholar
Trentin E, Gori M (2001) A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1-4):91–126
Article Google Scholar
Vesely K, Ghoshal A, Burget L, Povey D (2013) Sequence-discriminative training of deep neural networks. In: Proceedings of Interspeech, pp 2345–2349
Vesely K, Hannemann M, Burget L (2013) Semi-supervised training of deep neural networks. In: Proceedings of automatic speech recognition and understanding workshop (ASRU), pp 267–272
Vincent P, Larochelle H, Bengio Y, Manzagol P (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of int. conf. on machine learning (ICML), pp 1096–1103
Wei W, van Vuuren S (1998) Improved neural network training of inter-word context. In: Proceedings of ICASSP. https://doi.org/10.1109/ICASSP.1998.674476, pp 1520–6149
Wiesler S, Golik P, Schluter R, Ney H (2015) Investigations on sequence training of neural networks. In: Proceedings of ICASSP, pp 4565–4569
Xue S, Abdel-Hamid O, Jiang H, Dai L, Liu Q (2014) Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE Trans Audio Speech Lang Process 22(12):1713–1725
Article Google Scholar
Yang Z, Zhong A, Carass A, Ying SH, Prince JL (2014) Deep learning for cerebellar ataxia classification and functional score regression. Lect Notes Comput Sci 8679:68–76
Article Google Scholar
Yao K, You D, Seide F, Su H, Deng L, Gong Y (2012) Adaptation of context-dependent deep neural networks for automatic speech recognition. In: Proceedings of spoken language technology (SLT), pp 366–369
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK Book (for version 3.4). Cambridge University Engineering Department, Cambridge
Google Scholar
Yu D, Deng L, Dahl GE (2010) Roles of pretraining and fine-tuning in context-dependent DNN-HMMs for real-world speech recognition. In: Proceedings of NIPS workshop on deep learning and unsupervised feature learning
Yu D, Deng L (2015) Automatic speech recognition: a deep learning approach. Springer, London
MATH Google Scholar
Yu D, Seide G, Li G, Deng L (2012) Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In: Proceedings of ICASSP, pp 4409–4412
Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of int. conf. on machine learning (ICML), pp 919–926
Zhang C, Woodland PC (2014) Standalone training of context-dependent deep neural network acoustic models. In: Proceedings of ICASSP, pp 5597–5601
Zhao R, Li J, Gong Y (2014) Variable-component deep neural network for robust speech recognition. In: Proceedings of Interspeech
Zhou P, Jiang H, Dai L, Hu Y, Liu Q (2015) State-clustering based multiple deep neural networks modeling approach for speech recognition. IEEE Trans Audio Speech Lang Process 23(4):631–642
Article Google Scholar

Download references

Acknowledgments

The first author acknowledge all support given by the Universidad Autónoma de Zacatecas (UAZ) during the years 2014-2017 to realize his PhD academic formation. Additional acknowledgements for the support given by CONACyT during his stay of postgraduate studies.

Author information

Authors and Affiliations

Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Av. López Velarde No. 801, Col. Centro, C.P. 98068, Zacatecas, México
Aldonso Becerra, J. Ismael de la Rosa, Efrén González & A. David Pedroza
Department of Basic Sciences, Instituto Tecnológico de Pabellón de Arteaga, Carretera a la Estación de Rincón KM 1, C.P. 20670, Pabellón de Arteaga, Ags., México
N. Iracemi Escalante

Authors

Aldonso Becerra
View author publications
You can also search for this author in PubMed Google Scholar
J. Ismael de la Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Efrén González
View author publications
You can also search for this author in PubMed Google Scholar
A. David Pedroza
View author publications
You can also search for this author in PubMed Google Scholar
N. Iracemi Escalante
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aldonso Becerra.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Becerra, A., de la Rosa, J.I., González, E. et al. Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition. Multimed Tools Appl 77, 27231–27267 (2018). https://doi.org/10.1007/s11042-018-5917-5

Download citation

Received: 31 July 2017
Revised: 22 February 2018
Accepted: 18 March 2018
Published: 27 March 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11042-018-5917-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition

Abstract

Access this article

Similar content being viewed by others

A comparative case study of neural network training by using frame-level cost functions for automatic speech recognition purposes in Spanish

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Automatic Speech Recognition Based on Neural Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition

Abstract

Access this article

Similar content being viewed by others

A comparative case study of neural network training by using frame-level cost functions for automatic speech recognition purposes in Spanish

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Automatic Speech Recognition Based on Neural Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation