Speech recognition in a dialog system: from conventional to deep processing

Becerra, Aldonso; de la Rosa, J. Ismael; González, Efrén

doi:10.1007/s11042-017-5160-5

Speech recognition in a dialog system: from conventional to deep processing

A case study applied to Spanish

Published: 06 September 2017

Volume 77, pages 15875–15911, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Aldonso Becerra ORCID: orcid.org/0000-0002-4274-4396¹,
J. Ismael de la Rosa¹ &
Efrén González¹

600 Accesses
10 Citations
Explore all metrics

Abstract

The aim of this paper is to illustrate an overview of the automatic speech recognition (ASR) module in a spoken dialog system and how it has evolved from the conventional GMM-HMM (Gaussian mixture model - hidden Markov model) architecture toward the recent nonlinear DNN-HMM (deep neural network) scheme. GMMs have dominated for a long time the baseline of speech recognition, but in the past years with the resurgence of artificial neural networks (ANNs), the former models have been surpassed in most recognition tasks. An outstanding consideration for ANNs-based acoustic model is the fact that their weights can be adjusted in two training steps: i) initialization of the weights (with or without pre-training) and ii) fine-tuning. To exemplify these frameworks, a case study is realized by using the Kaldi toolkit, employing a mid-vocabulary with a personalized speaker-independent voice corpus on a connected-words phone dialing environment operated for recognition of digit strings and personal name lists in Spanish from Mexico. The obtained results show a reasonable accuracy in the speech recognition performance through the DNN acoustic modeling. A word error rate (WER) of 1.49% for context-dependent DNN-HMM is achieved, providing a 30% relative improvement with regard to the best GMM-HMM result in these experiments (2.12% WER).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Speech Recognition in English Language: A Review

Automatic Speech Recognition

An Analysis of Deep Neural Networks in Broad Phonetic Classes for Noisy Speech Recognition

References

Ali A, Zhang Y, Cardinal P, Dahak N, Vogel S, Glass J (2014) A complete KALDI recipe for building Arabic speech recognition systems. In: Proceeedings of IEEE Workshop Spokoen Language Technology (SLT), pp 525–529. https://doi.org/10.1109/SLT.2014.7078629
Anusuya MA, Katti SK (2009) Speech recognition by machine: a review. Int J Comput Sci Inf Secur 6(2):181–205
Google Scholar
Bacchiani M, Senior A, Heigold G (2014) Asynchronous, Online, GMM-free training of a context dependent acoustic model for speech recognition. In: Proceedings of European Conference on Speech Communication and Technology, pp 1900–1904
Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy layer-wise training of deep networks. In: Proceedings of Neural Information Processing Systems, pp 153–160
Bilmes J (2006) What HMMs can do. IEICE Trans Inf Syst E89-D(3):869–891
Article Google Scholar
Bishop C (2006) Pattern recognition and machine learning. Springer, NY
MATH Google Scholar
Cai M, Shi Y, Liu J (2013) Deep maxout neural networks for speech recognition. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, pp 291–296
Chen X, Eversole A, Li G, Yu D, Seide F (2012) Pipelined back-propagation for context-dependent deep neural networks. In: Proceedings of INTERSPEECH
Dahl GE, Yu D, Deng L, Acero A (2011) Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 4688–4691
Dahl G E, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42
Article Google Scholar
Dahl GE, Sainath TN, Hinton G (2013) Improving deep neural networks for LVCSR using rectified linear units and dropout. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 8609–8613
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition. IEEE Trans Acoust Speech, Signal Process ASSP-28 (4):357–366
Article Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Statist Soc 39(1):1–38
MathSciNet MATH Google Scholar
Deng L (2014) A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans Signal Inf Process 3:e2. https://doi.org/10.1017/atsip.2013.9
Article Google Scholar
Deng L, Li X (2013) Machine learning paradigms for speech recognition: an overview. IEEE Trans Audio Speech, Lang Process 21(5):1060–1089
Article Google Scholar
Deng L, Yu D (2014) Deep learning: methods and applications. Now Plublishers, Washington
MATH Google Scholar
Deng L, Kenny P, Lennig M, Gupta V, Seitz F, Mermelstein P (1991) Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans Signal Process 39(7):1677–1681
Article Google Scholar
Deng L, Yu D, Platt J (2012) Scalable stacking and learning for building deep architectures. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 2133–2136
Deng L, Hinton G, Kingsbury B (2013) New types of deep neural network learning for speech recognition and related applications: an overview. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 8599–8603
Deng L, Li J, Huang JT, Yao K, Yu D, Seide F, Seltzer ML, Zweig G, He X, Williams J, Gong Y, Acero A (2013) Recent advances in deep learning for speech research at Microsoft. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 8604–8608. https://doi.org/10.1109/ICASSP.2013.6639345
Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, NY
MATH Google Scholar
Gales MJF, Young SJ (2007) The application of hidden Markov models in speech recognition. Found Trends Signal Process 1(3):195–304
Article MATH Google Scholar
Gauvain J, Lee Ch (1994) Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2 (2):291–298
Article Google Scholar
Gose E, Johnsonbaugh R, Jost S (1996) Pattern recognition and image analysis. Prentice-Hall, New Jersey
Google Scholar
Gupta S, Jaafar J, wan Ahmad WF, Bansal A (2013) Feature extraction using MFCC. Signal Image Process: Int J 4(4):101–108
Google Scholar
Heigold G, Ney H, Schlüter R, Wiesler S (2012) Discriminative training for automatic speech recognition: modeling, criteria, optimization, implementation, and performance. IEEE Signal Process Mag 29(6):58–69
Article Google Scholar
Heigold G, Ney H, Schlüter R (2013) Investigations on an EM-style optimization algorithm for discriminative training of HMMs. IEEE Trans Audio Speech Lang Process 21(12):2616– 2626
Article Google Scholar
Hen Hu Y, Hwang J (2002) Handbook of neural networks signal processing. CRC Press, Florida
Google Scholar
Hinton G (2010) A practical guide to training restricted Boltzmann machines. Technical Report UTML TR, pp 2010–003
Hinton G, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Hinton G, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MathSciNet MATH Google Scholar
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acustic modeling in speech recognition. IEEE Signal Process Mag 29(6):82–97
Article Google Scholar
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detector, arXiv:1207.0580v1
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800. https://doi.org/10.1162/089976602760128018
Article MATH Google Scholar
Huang X, Acero A, Hon H (2001) Spoken language processing: a guide to theory, algorithm and system development. Prentice Hall, NJ
Google Scholar
Huang Y, Yu D, Liu C, Gong Y (2014) A comparative analytic study on the Gaussian mixture and context dependent deep neural network hidden Markov models. In: Proceedings of INTERSPEECH 2014, pp 1895–1899
Huang Z, Li J, Weng Ch, Lee Ch (2014) Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition. In: Proceeedings of INTERSPEECH 2014, pp 1214–1218
Jaitly N (2014) Exploring deep learning methods for discovering features in speech signals. Dissertation. University of Toronto, Toronto
Google Scholar
Jaitly N, Hinton G (2013) Using an autoencoder with deformable templates to discover features for automated speech recognition. In: Proceedings of INTERSPEECH, pp 1737–1740
Jaitly N, Nguyen P, Senior A, Vanhoucke V (2012) Application of pretrained deep neural networks to large vocabulary conversational speech recognition. UTML TR
Jiang H (2010) Discriminative training of HMMs for automatic speech recognition: A survey. Comput Speech Lang 24(4):589–608
Article Google Scholar
Juang BH, Levinson SE, Sondhi M (1986) Maximum likelihood estimation for multivariate mixture observations of Markov chains. IEEE Transactions on Information Theory IT-32(2):307–309
Article Google Scholar
Jurafsky D, Martin J (2008) Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition. Pearson, NJ
Google Scholar
Kaur K, Jain N (2015) Feature extraction and classification for automatic speaker recognition system – a review. Int J Adv Res Comput Sci Softw Eng 5(1):1–6
Google Scholar
Li J, Yu D, Huang JT, Gong Y (2012) Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM. In: Proceedings of IEEE Workshop on Spoken Language Technology SLT, pp 131–136. https://doi.org/10.1109/SLT.2012.6424210
Li X, Yang Y, Pang Z, Wu X (2015) A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary chinese speech recognition. Neurocomputing 170:251–256
Article Google Scholar
Maas A, Hannun A, Ng A (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of International Conference on Machine Learning
Macho D, Mauuary L, Noé B, Cheng YM, Ealey D, Jou-vet D, Kelleher H, Pearce D, Saadoun F (2002) Evaluation of a noise-robust DSR front-end on Aurora databases. In: Proceedings of International Conference on Spoken Language Processing, pp 16–20
McLachlan G (1988) Mixture models. Marcel Dekker, New York
MATH Google Scholar
Miao Y, Metze F (2013) Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training. In: Proceedings of INTERSPEECH 2013, pp 2237–2241
Mohamed A, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech, Lang Process 20(1):14–22
Article Google Scholar
Mohamed A, Dahl GE, Hinton G (2009) Deep Belief Networks for phone recognition. In: Proceedings of NIPS Workshop on Deep Learning for Speech Recognition and Related Applications
Morgan N, Bourlard H (1995) An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Process Mag 12(3):25–42
Article Google Scholar
Nakagawa S, Zhang W, Takahashi M (2006) Text-independent/text-prompted speakers recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM. IEICE Trans Inf Syst E89-D(3):1058–1065
Article Google Scholar
Niu J, Xie L, Jia L, Hu N (2013) Context-dependent deep neural networks for commercial Mandarin speech recognition applications. In: Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
Noguchi H, Miura K, Fujinaga T, Sugahara T, Kawaguchi H, Yoshimoto M (2011) VLSI Architecture of GMM Processing and Viterbi Decoder for 60,000-Word Real-Time Continuous Speech Recognition. IEICE Trans Electron E94C(4):458–467
Article Google Scholar
Pan J, Liu C, Wang Z, Hu Y, Jiang H (2012) Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: why DNN surpass GMMs in acoustic modeling. In: Proceedings of International Symposium on Chinese Spoken Language Processing, pp 301–305
Picone JW (1993) Signal modeling techniques in speech recognition. Proc IEEE 81(9):1215–1247
Article Google Scholar
Povey D, Burget L, Agarwal M, Akyazi P, Kai F, Ghoshal A, Glembekb O, Goel N, Karafiát M, Rastrowh A, Rose R, Schwarz P, Thomash S (2011) The subspace Gaussian mixture model - A structured model for speech recognition. Comput Speech Lang 25(2):404–439
Article Google Scholar
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The Kaldi speech recognition toolkit. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop
Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceed IEEE 77(2):257–286
Article Google Scholar
Rabiner L, Juang B (1993) Fundamentals of speech recognition. Prentice-Hall, New Jersey
MATH Google Scholar
Rabiner L, Schafer R (2007) Introduction to digital speech processing. Found Trends Signal Process 1(1-2):1–194
Article MATH Google Scholar
Rath S, Povey D, Vesel K, Cernock J (2013) Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH 2013, pp 109–113
Reynolds DA, Quatieri TF, Dunn TRB (2000) Speaker verification using adapted gaussian mixture models. Digit Signal Process 10(1):19–41
Article Google Scholar
Rumelhart DE, Hinton G, Williams RJ (1986) Learning representations by back-propagating errors. Nature f323:533–536
Article MATH Google Scholar
Sainath TN, Kingsbury B, Ramabhadran B, Fousek P, Novak P, Mohamed A (2011) Making Deep Belief Networks effective for large vocabulary continuous speech recognition. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, pp 30–35
Sainath T N, Kingsbury B, Ramabhadran B (2012) Improving training time of deep belief networks through hybrid pre-training and larger batch sizes. In: Proceedings of Neural Information Processing Systems, Workshop on Log-linear Models
Sainath TN, Mohamed A, Kingsbury B, Ramabhadran B (2013) Deep Convolutional neural networks for LVCSR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 8614–8618
Saon G, Chien J (2012) Large-vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Signal Process Mag 29(6):18–33
Article Google Scholar
Saon G, Chien J (2012) Recent developments in large vocabulary continuous speech recognition. In: Proceedings of Asia Pacific Signal and Information Processing Association
Scowen R (1993) Extended bnf - generic base standards. In: Proceedings of Software Engineering Standards Symp, pp 25–34
Seide F, Li G, Chen X, Yu D (2011) Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, pp 24–29
Seide F, Li G, Yu D (2011) Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of INTERSPEECH 2011, pp 437–440
Seltzer ML, Yu D, Wang Y (2013) An Investigation of deep neural networks for noise robust speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7398–7402
Senior A, Heigold G, Bacchiani M, Liao H (2014) GMM-free DNN training. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5639–5643
Sharma S, Ellis D, Kajarekar S, Jain P, Hermansky H (2000) Feature extraction using non-linear transformation for robust speech recognition on the aurora database. In: Proceedings of IEEE International Conference on Acoustics, Speechs and Signal Processing, pp II1117–II1120
Siniscalchi SM, Yu D, Deng L, Lee Ch (2012) Exploiting deep neural networks for detection- based speech recognition. Neurocomputing 106(2013):148–157
Google Scholar
Stahlberg F, Schlippe T, Stephan V, Schultz T (2014) Towards automatic speech recognition without pronunciation dictionary, transcribed speech and text resources in the target language using cross-lingual word-to-phoneme alignment. In: Proceedings of Workshop on Spoken Language Technologies for Under-Resourced Languages, pp 73–80
Strik H, Russel A, Van Den Heuvel H, Cucchiarini C, Boves L (1997) A spoken dialog system for the dutch public transport information service. Int J Technol 2:121–131
Google Scholar
Tao D, Cheng Y, Song M, Lin X (2016) Manifold ranking-based matrix factorization for saliency detection. IEEE Trans Neural Netw Learn Syst 27(6):1122–1134
Article MathSciNet Google Scholar
Tao D, Lin X, Jin L, Li X (2016) Principal component 2-D long short-term memory for font recognition on single chinese characters. IEEE Trans Cybern 46(3):756–765
Article Google Scholar
Tao D, Guo Y, Song M, Li Y, Yu Z, Tang Y (2016) Person re-identification by dual-regularized KISS metric learning. IEEE Trans Image Process 25(6):2726–2738
Article MathSciNet Google Scholar
Trentin E, Gori M (2001) A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37:91–126
Article MATH Google Scholar
Vesely K, Ghoshal A, Burget L, Povey D (2013) Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH 2013, pp 2345–2349
Vesely K, Hannemann M, Burget L (2013) Semi-Supervised training of Deep Neural Networks. In: Proceedings of IEEE Conference of Automatic Speech Recognition and Understanding Workshop, pp 267–272
Wang G (2014) Context-dependent acoustic modelling for speech recognition. Dissertation. National University of Singapur, Singapur
Google Scholar
Xu Y, Du J, Dai L R, Lee C h (2014) An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett 21(1):1070–9908
Google Scholar
Yao K, Yu D, Seide F, Su H, Deng L, Gong Y (2012) Adaptation of context-dependent deep neural networks for automatic speech recognition. In: Proceedings of IEEE Spoken Language Technology Workshop, pp 366–369
Young S (1996) Large vocabulary continuous speech recognition: a review. IEEE Signal Process Mag 13(5):45–57
Article Google Scholar
Young S (2008) HMMs and related speech recognition technologies. In: Benesty J (ed) Springer Handbook of Speech Processing. Springer Berlin Heidelberg, Berlin, pp 539–558
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK Book (for version 3.4). Cambridge University Engineering Department, UK
Google Scholar
Yu D, Deng L (2015) Automatic speech recognition: a deep learning approach. Springer, London
Book MATH Google Scholar
Yu D, Deng L, Dahl GE (2010) Roles of pretraining and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In: Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning
Yu D, Deng L, Li G, Seide F (2011) Discriminative pretraining of deep neural networks. Patent Filing, US
Google Scholar
Zhang C, Woodland PC (2014) Standalone training of context-dependent deep neural network acoustic models. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5597–5601
Zhang S, Bao Y, Zhou P, Jiang H, Li-Rong D (2014) Improving deep neural networks for LVCSR using dropout and shrinking structure. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6899–6903
Zhang X, Trmal J, Povey D, Khudanpur S (2014) Improving deep neural network acoustic models using generalized maxout networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. https://doi.org/10.1109/ICASSP.2014.6853589

Download references

Acknowledgments

The first author acknowledge all support given by the Universidad Autónoma de Zacatecas (UAZ) during the years 2014-2017 to realize his PhD academic formation. Additional acknowledgements for the support given by CONACyT during his stay of postgraduate studies.

Author information

Authors and Affiliations

Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Av. López Velarde No. 801, Col. Centro, C.P. 98068, Zacatecas, México
Aldonso Becerra, J. Ismael de la Rosa & Efrén González

Authors

Aldonso Becerra
View author publications
You can also search for this author in PubMed Google Scholar
J. Ismael de la Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Efrén González
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aldonso Becerra.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Becerra, A., de la Rosa, J.I. & González, E. Speech recognition in a dialog system: from conventional to deep processing. Multimed Tools Appl 77, 15875–15911 (2018). https://doi.org/10.1007/s11042-017-5160-5

Download citation

Received: 24 January 2017
Revised: 05 July 2017
Accepted: 28 August 2017
Published: 06 September 2017
Issue Date: June 2018
DOI: https://doi.org/10.1007/s11042-017-5160-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech recognition in a dialog system: from conventional to deep processing

Abstract

Access this article

Similar content being viewed by others

Automatic Speech Recognition in English Language: A Review

Automatic Speech Recognition

An Analysis of Deep Neural Networks in Broad Phonetic Classes for Noisy Speech Recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech recognition in a dialog system: from conventional to deep processing

Abstract

Access this article

Similar content being viewed by others

Automatic Speech Recognition in English Language: A Review

Automatic Speech Recognition

An Analysis of Deep Neural Networks in Broad Phonetic Classes for Noisy Speech Recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation