Abstract
Nowadays Long Short-Term Memory RNNs (LSTM RNNs) are widely used in Automatic Speech Recognition (ASR) and achieved excellent result in the problem of vanishing gradients. Bidirectional LSTM (BLSTM) will run the inputs in two ways ,both past as well as in future that shows good performance. However implementation of BLSTM is quite difficult because of its high computational requirements and also the problem of vanishing gradients still persist, when we have multiple layer of LSTM. The extensive size of LSTM systems makes them powerless in over fitting issues. The Gated Recurrent Unit (GRU) is the latest generation recurrent neural networks with two gates. The update gate acts similar to forget and input gates of LSTM’s and reset gate responsible to decide how much previous data you should remember. GRU avoids over fitting and also training the GRU is faster compared to LSTM, since size of the GRU network is small. The proposed work is in two- fold architecture. First stage, we tend to reduce the gates in GRU by combining the reset and update gate together to form a Single Gated Unit (SGU). SGU takes half of the parameter compared with LSTM and one third of parameter compared with GRU. It increases the training speed of SGU. Second stage, SGU is combined with Deep Bidirectional design (DBSGU) to build a hybrid acoustic model that takes less number of parameters and increases the learning capability. The proposed model is compared with similarities and differences between Deep Bidirectional GRU (DBGRU) and Deep Bidirectional LSTM (DBLSTM) and found that 2 to 4% decrease in Word Error Rate (WER).The Learning rate of the is increased by 30% The entire work has been evaluated on Crowd Sourced high-quality Multi-Speaker speech (CSMS) data set.










Similar content being viewed by others
References
Abandah GA, Graves A, Al-Shagoor B, Arabiyat A, Al-Taee M (2015) Automatic diacritization of Arabic text using recurrent neural networks. Int J Doc Anal Recognit (IJDAR) 18(2):183–197
Barman PP, Boruah A (2018) A RNN based approach for next word prediction in Assamese phonetic transcription. Procedia Comput Sci 143:117–123 (ISSN 1877 – 0509)
Chavandan RS, Sable GS (2013) An overview of speech recognition using HMM. Int J Comput Sci Mob Comput 2(6):233–238
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv arXiv:1412.3555
Cheng G, Povey D, Huang L, Xu J, Khudanpur S, Yan Y (2018) Output-gate projected gated recurrent unit for speech recognition. Interspeech, pp 1793–1797
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5–6):602–610
Graves A, Mohamed A, Hinton G (2013) Speech recognitionwith deep recurrent neural networks. In: Proc ICASSP 2013, Vancouver, Canada
Greff K, Srivastava RK, Koutnk J, Steunebrink BR, Schmidhuber J (2015) LSTM: A search space odyssey. arXiv: 1503.04069
He F, Chu SH, Kjartansson O, Rivera C, Katanova A, Gutkin A, Demirsahin I, Johny C, Jansche M, Sain S et al (2020) Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems. In: Proceedings of the 12th LREC Conference, Marseille, France, 11–16
Hochreiter S, Jürgen S (1997) Long short-term memory. Neural Comput 9:1735–80. https://doi.org/10.1162/neco.1997.9.8.1735
Jozefowicz R, Zaremba W, Sutskever I (2015) An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, vol 37, pp 2342–2350
Kim J, Kim J, Thu HLT, Kim H (2016) Long short term memory recurrent neural network classifier for intrusion detection. International Conference on Platform Technology and Service (PlatCon), Jeju, pp 1–5
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization, CoRR, vol abs/1412.6980
Kumar J, Goomer R, Singh AK (2018) Long Short Term Memory Recurrent Neural Network (LSTM-RNN) based workload forecasting model for cloud datacenters. Procedia Comput Sci 125:676–682 (ISSN 1877 – 0509)
Kumar S, Hussain L, Banarjee S, Reza M (2018) Energy load forecasting using deep learning approach-LSTM and GRU in spark cluster. Fifth International Conference on Emerging Applications of Information Technology (EAIT), Kolkata, pp 1–4
Li X, Xianyu H, Tian J, Chen W, Meng F, Xu M et al (2016) A deep bidirectional long short-term memory based multi-scale approach for music dynamic emotion prediction. In: IEEE International Conference in Acoustics, Speech and Signal Processing (ICASSP); Shanghai, China, p 544–548
Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: an ASR corpus based on public domain audio books. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, pp 5206–5210
Panzner M, Cimiano P (2016) Comparing hidden Markov models and long short term memory neural networks for learning action representations. In: Pardalos P, Conca P, Giuffrida G, Nicosia G (eds) Machine Learning, Optimization, and Big Data. MOD 2016, vol 10122. Springer, Cham
Povey D, Ghoshal A, Boulianne G, Goel N, Hannemann M, Qian Y, Schwarz P, Stemmer G (2011) The kaldi speech recognitiontoolkit. In: Workshop on Automatic Speech Recognition and Understanding (ASRU), Hawaii, US, pp 1–4
Ravanelli M, Brakel P, Omologo M, Bengio Y (2018) Light gated recurrent units for speech recognition. IEEE Trans Emerg Top Comput Intell 2(2):92–102
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. Signal Process IEEE Trans 45(11):2673–2681
Stolcke A (2002) SRILM-An extensible language modeling toolkit. In: International Conference on Spoken Language Processing (ICSLP), Denver, Colorado, pp 901–904
Thireou T, Reczko M (2007) Bidirectional Long Short-Term Memory Networks for Predicting the Subcellular Localization of Eukaryotic Proteins. IEEE/ACM Trans Comput Biol Bioinform 4(3):441–446
Zhang Y, Chen G, Yu D, Yao K, Khudanpur S, Glass JR (2016) Highway long short-term memory RNNS for distant speech recognition. In: Proc. of ICASSP 2016, pp 5755–5759
Zhou G-B, Wu J, Zhang C-L, Zhou Z-H (2016) Minimal gated unit for recurrent neural networks. Int J Automat Comput 13(3):226–234
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Girirajan, S., Pandian, A. Acoustic model with hybrid Deep Bidirectional Single Gated Unit (DBSGU) for low resource speech recognition. Multimed Tools Appl 81, 17169–17184 (2022). https://doi.org/10.1007/s11042-022-12723-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12723-4