Performance analysis of ASR system in hybrid DNN-HMM framework using a PWL euclidean activation function

Dutta, Anirban; Ashishkumar, Gudmalwar; Rao, Ch V. Rama

doi:10.1007/s11704-020-9419-z

Performance analysis of ASR system in hybrid DNN-HMM framework using a PWL euclidean activation function

Research Article
Published: 05 June 2021

Volume 15, article number 154705, (2021)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Anirban Dutta¹,
Gudmalwar Ashishkumar¹ &
Ch V. Rama Rao¹

196 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Automatic Speech Recognition (ASR) is the process of mapping an acoustic speech signal into a human readable text format. Traditional systems exploit the Acoustic Component of ASR using the Gaussian Mixture Model — Hidden Markov Model (GMM-HMM) approach. Deep Neural Network (DNN) opens up new possibilities to overcome the shortcomings of conventional statistical algorithms. Recent studies modeled the acoustic component of ASR system using DNN in the so called hybrid DNN-HMM approach. In the context of activation functions used to model the non-linearity in DNN, Rectified Linear Units (ReLU) and maxout units are mostly used in ASR systems. This paper concentrates on the acoustic component of a hybrid DNN-HMM system by proposing an efficient activation function for the DNN network. Inspired by previous works, euclidean norm activation function is proposed to model the non-linearity of the DNN network. Such non-linearity is shown to belong to the family of Piecewise Linear (PWL) functions having distinct features. These functions can capture deep hierarchical features of the pattern. The relevance of the proposal is examined in depth both theoretically and experimentally. The performance of the developed ASR system is evaluated in terms of Phone Error Rate (PER) using TIMIT database. Experimental results achieve a relative increase in performance by using the proposed function over conventional activation functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Analysis of Deep Neural Networks in Broad Phonetic Classes for Noisy Speech Recognition

Automatic Speech Recognition

A comparative study of deep neural network based Punjabi-ASR system

Article 15 December 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Baker J M, Deng L, Glass J, Khudanpur S, Lee C H, Morgan N, O’Shaughnessy D. Developments and directions in speech recognition and understanding, Part 1 [DSP Education]. IEEE Signal Processing Magazine, 2009, 26(3): 75–80
Article Google Scholar
Lawrence R. Fundamentals of Speech Recognition. India: Pearson Education, 2008
Google Scholar
Young S. A review of large vocabulary continuous speech. IEEE Signal Processing Magazine, 1996, 13(5): 45
Article Google Scholar
Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504–507
Article MathSciNet Google Scholar
McDermott E, Hazen T J, Le Roux J, Nakamura A, Katagiri S. Discriminative training for large vocabulary speech recognition using minimum classification error. IEEE Transactions on Audio, Speech and Language Processing, 2006, 15(1): 203–223
Article Google Scholar
Saon G, Chien J T. Large vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Signal Processing Magazine, 2012, 29(6): 18–33
Article Google Scholar
Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets. Neural Computation, 2006, 18(7): 1527–1554
Article MathSciNet Google Scholar
He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904–1916
Article Google Scholar
Chen K, Ding G, Han J. Attribute based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11(2): 219–229
Article Google Scholar
Graves A, Jaitly N. Towards end to end speech recognition with recurrent neural networks. In: Proceedings of International Conference on Machine Learning. 2014, 1764–1772
Ying W, Zhang L, Deng H. Sichuan dialect speech recognition with deep LSTM network. Frontiers of Computer Science, 2020, 14(2): 378–387
Article Google Scholar
Yan Y, Chen Z, Liu Z. Semi-tensor product of matrices approach to reachability of finite automata with application to language recognition. Frontiers of Computer Science, 2014, 8(6): 948–957
Article MathSciNet Google Scholar
Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 2018, 13(3): 55–75
Article Google Scholar
Dahl G E, Yu D, Deng L, Acero A. Context dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 2011, 20(1): 30–42
Article Google Scholar
Zhang Q, Zhang L. Convolutional adaptive denoising autoencoders for hierarchical feature extraction. Frontiers of Computer Science, 2018, 12(6): 1140–1148
Article Google Scholar
Rong W, Peng B, Ouyang Y, Li C, Xiong Z. Structural information aware deep semi-supervised recurrent neural network for sentiment analysis. Frontiers of Computer Science, 2015, 9(2): 171–184
Article MathSciNet Google Scholar
Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. 2015
Chan W, Jaitly N, Le Q, Vinyals O. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. 2016, 4960–4964
Bellegarda J R, Monz C. State of the art in statistical methods for language and speech processing. Computer Speech & Language, 2016, 35: 163–184
Article Google Scholar
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 2010, 249–256
Zhao W, San Y. RBF neural network based on q Gaussian function in function approximation. Frontiers of Computer Science in China, 2011, 5(4): 381–386
Article MathSciNet Google Scholar
Wan L, Zeiler M, Zhang S, LeCun Y, Fergus R. Regularization of neural networks using dropconnect. In: Proceedings of International Conference on Machine Learning. 2013, 1058–1066
Hsu W N, Zhang Y, Glass J. Unsupervised domain adaptation for robust speech recognition via variational autoencoder based data augmentation. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop. 2017, 16–23
Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning. 2010, 807–814
Agostinelli F, Hoffman M, Sadowski P, Baldi P. Learning activation functions to improve deep neural networks. 2014, arXiv preprint arXiv:1412.6830
Springenberg J T, Riedmiller M. Improving deep neural networks with probabilistic maxout units. 2013, arXiv preprint arXiv:1312.6116
Le Q V, Jaitly N, Hinton G E. A simple way to initialize recurrent networks of rectified linear units. 2015, arXiv preprint arXiv:1504.00941
Graves A, Jaitly N, Mohamed A R. Hybrid speech recognition with deep bidirectional LSTM. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. 2013, 273–278
Chen J, Zhang Q, Liu P, Qiu X, Huang X J. Implicit discourse relation detection via a deep architecture with gated relevance network. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 1726–1735
Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 2011, 315–323
Goodfellow I J, Warde-Farley D, Mirza M, Courville A, Bengio Y. Maxout networks. In: Proceedings of International Conference on Machine Learning. 2013, 1319–1327
Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of International Conference on Machine Learning. 2013
Dahl G E, Sainath T N, Hinton G E. Improving deep neural networks for LVCSR using rectified linear units and dropout. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2013, 8609–8613
Zhang X, Trmal J, Povey D, Khudanpur S. Improving deep neural network acoustic models using generalized maxout networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2014, 215–219
Cai M, Liu J. Maxout neurons for deep convolutional and LSTM neural networks in speech recognition. Speech Communication, 2016, 77: 53–64
Article Google Scholar
Aggarwal C C, Hinneburg A, Keim D A. On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of International Conference on Database Theory. 2001, 420–434
Hinton G, Deng L, Yu D, Dahl G E, Mohamed A R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T N, Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 2012, 29(6): 82–97
Article Google Scholar
Gales M, Young S. The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 2008, 1(3): 195–304
Article Google Scholar
Montufar G, Pascanu R, Cho K, Bengio Y. On the number of linear regions of deep neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 2924–2932
Pascanu R, Montufar G, Bengio Y. On the number of response regions of deep feed forward networks with piece wise linear activations. 2013, arXiv preprint arXiv:1312.6098
Srivastava R K, Masci J, Gomez F, Schmidhuber J. Understanding locally competitive networks. 2014, arXiv preprint arXiv:1410.1165
Arora R, Basu A, Mianjy P, Mukherjee A. Understanding deep neural networks with rectified linear units. In: Proceedings of International Conference on Learning Pepresentation. 2018
Stallings J. The piecewise linear structure of euclidean space. In: Proceedings of the Cambridge Philosophical Society. 1962, 481–488
Amin H, Curtis K M, Hayes-Gill B R. Piecewise linear approximation applied to nonlinear function of a neural network. IEE Proceedings-Circuits, Devices and Systems, 1997, 144(6): 313–317
Article Google Scholar
Serra T, Tjandraatmadja C, Ramalingam S. Bounding and counting linear regions of deep neural networks. In: Proceedings of International Conference on Machine Learning. 2018, 4558–4566
Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor” meaningful? In: Proceedings of International Conference on Database Theory. 1999, 217–235
Gold B, Morgan N, Ellis D. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. John Wiley & Sons, 2011
Rath S P, Povey D, Vesely K, Cernocky J. Improved feature processing for deep neural networks. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association. 2013, 109–113
Povey D, Zhang X, Khudanpur S. Parallel training of DNNs with natural gradient and parameter averaging. 2014, arXiv preprint arXiv:1410.7455
Garofolo J S, Lamel L F, Fisher W M, Fiscus J G, Pallett D S. DARPA TIMIT acoustic phonetic continous speech corpus CD ROM. NIST Speech Disc 1–1.1. NASA STI/Recon Technical Report n, 1993
Hifny Y, Renals S. Speech recognition using augmented conditional random fields. IEEE Transactions on Audio, Speech and Language Processing, 2009, 17(2): 354–365
Article Google Scholar
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J. The Kaldi speech recognition toolkit. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. 2011
Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press, 2016
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436–444
Article Google Scholar
Huang P S, Avron H, Sainath T N, Sindhwani V, Ramabhadran B. Kernel methods match deep neural networks on timit. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, 205–209

Download references

Acknowledgements

This work was an outcome of the R&D work undertaken project under the Visvesvaraya PhD Scheme of Ministry of Electronics & Information Technology, Government of India, being implemented by Digital India Corporation. We are thankful to Electronics and Communication Engineering Department, National Institute of Technology Meghalaya for giving us the opportunity to use the necessary equipments required to conduct the research.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Meghalaya, Shillong, 793003, India
Anirban Dutta, Gudmalwar Ashishkumar & Ch V. Rama Rao

Authors

Anirban Dutta
View author publications
You can also search for this author inPubMed Google Scholar
Gudmalwar Ashishkumar
View author publications
You can also search for this author inPubMed Google Scholar
Ch V. Rama Rao
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Anirban Dutta.

Additional information

Anirban Dutta is currently a PhD scholar in the Department of Electronics and Communication Engineering, National Institute of Technology Meghalaya, India. During his PhD tenure he has made significant contribution in speech processing, neural networks and its applications. His domain of research includes automatic speech recognition, pattern recognition and deep neural networks.

Gudmalwar Ashishkumar received the BTech degree in Electronics and Communication Engineering from SRTMU, India in 2015, and MTech degree in VLSI design from the National Institute of Technology Meghalaya, India in 2017, where he is currently pursuing PhD degree with the Department of Electronics and communication Engineering. His research interests include speech processing, speech emotion recognition and signal processing.

Ch V Rama Rao received the PhD degree from JNTU Hyderabad, India. He is currently an assistant professor in the Department of Electronics and Communication Engineering at National Institute of Technology Meghalaya, India. During his PhD, he has made significant contributions in the field of speech enhancement. His research interest include speech technology, pattern recognition, statistical signal processing, signal processing issues in advanced communication systems and design and development of advanced wireless communication systems.

Electronic supplementary material