Abstract
This paper demonstrates the effect of incorporating Deep Neural Network techniques in speech recognition systems. Speech recognition through hybrid Deep Neural Networks on the Kaldi toolkit for the Punjabi language is implemented. Performance of the automatic speech recognition system drastically improves using DNN, and further Karel's DNN model gives better recognition performance as compared to Dan's DNN model. Out of MFCC and PLP features, the MFCC feature gives better results. The triphone model gives a lower word error rate than the monophone model, and 3-g gives a lower word error rate as compared to a 2-g model on the Kaldi toolkit for the continuous Punjabi speech recognition system.
Similar content being viewed by others
References
Badino, L., Canevari, C., Fadiga, L., & Metta, G. (2016). Integrating articulatory data in deep neural network-based acoustic modeling. Computer Speech & Language, 36, 173–195. https://doi.org/10.1016/j.csl.2015.05.005.
Cosi, P. (n.d.). Phone Recognition Experiments on ArtiPhon with KALDI. EVALITA. Evaluation of NLP and Speech Tools for Italian (pp. 26–31). https://doi.org/10.4000/books.aaccademia.1932
Cosi, P. (2015). A KALDI-DNN-based ASR system for Italian. In 2015 International Joint Conference on Neural Networks (IJCNN). https://doi.org/10.1109/ijcnn.2015.7280336
Dahl, G. E., Dong, Yu, Deng, Li, & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42. https://doi.org/10.1109/tasl.2011.2134090.
Dua, M., Kadyan, V., Aggarwal, R. K., & Dua, S. (2012). Punjabi speech to text system for connected words. Fourth International Conference on Advances in Recent Technologies in Communication and Computing (ARTCom2012). https://doi.org/10.1049/cp.2012.2528
Erdogan, H., Hershey, J. R., Watanabe, S., & Le Roux, J. (2015). Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp.2015.7178061.
Ghai, W., & Singh, N. (2013). Continuous speech recognition for Punjabi language. International Journal of Computer Applications, 72(14), 23–28. https://doi.org/10.5120/12563-9002.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., et al. (2012a). Deep neural networks for acoustic modeling in speech recognition: The shared views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. https://doi.org/10.1109/msp.2012.2205597.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., et al. (2012b). Deep neural networks for acoustic modeling in speech recognition: The shared views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. https://doi.org/10.1109/msp.2012.2205597.
Horndasch, A., Kaufhold, C., & Nöth, E. (2016). How to add word classes to the Kaldi Speech Recognition Toolkit. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-319-45510-5_56.
Meftah, A. H., Alotaibi, Y. A., & Selouani, S.-A. (2018). Evaluation of an Arabic Speech Corpus of emotions: A perceptual and statistical analysis. IEEE Access, 6, 72845–72861. https://doi.org/10.1109/access.2018.2881096.
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., et al. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. Interspeech. https://doi.org/10.21437/interspeech.2016-595.
Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding. https://doi.org/10.1109/asru.2011.6163899
Sigtia, S., & Dixon, S. (2014). Improved music feature learning with deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp.2014.6854949
Tang, H., Hasegawa-Johnson, M., & Huang, T. S. (2010). Toward robust learning of the Gaussian mixture state emission densities for hidden Markov models. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. https://doi.org/10.1109/icassp.2010.5494989
Upadhyaya, P., Farooq, O., Abidi, M. R., & Varshney, Y. V. (2017). Continuous Hindi speech recognition model based on Kaldi ASR toolkit. In 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET). https://doi.org/10.1109/wispnet.2017.8299868
Vesely, K., Hannemann, M., & Burget, L. (2013). Semi-supervised training of Deep Neural Networks. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. https://doi.org/10.1109/asru.2013.6707741
Vu, N. T., Imseng, D., Povey, D., Motlicek, P., Schultz, T., & Bourlard, H. (2014). Multilingual deep neural network based acoustic modeling for rapid language adaptation. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp.2014.6855086
Wong, J. H. M., & Gales, M. J. F. (2016). Sequence student-teacher training of deep neural networks. Interspeech. https://doi.org/10.21437/interspeech.2016-911.
Woodland, P. C., Gales, M. J. F., Pye, D., & Young, S. J. (n.d.). Broadcast news transcription using HTK. 1997
Xie, Y., Le, L., Zhou, Y., & Raghavan, V. V. (2018). Deep learning for natural language processing. Computational Analysis and Understanding of Natural Languages: Principles, Methods and Applications. https://doi.org/10.1016/bs.host.2018.05.001.
Zhang, X., Trmal, J., Povey, D., & Khudanpur, S. (2014). Improving deep neural network acoustic models using generalized maxout networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp.2014.6853589
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Guglani, J., Mishra, A.N. DNN based continuous speech recognition system of Punjabi language on Kaldi toolkit. Int J Speech Technol 24, 41–45 (2021). https://doi.org/10.1007/s10772-020-09717-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-020-09717-8