Skip to main content
Log in

A comparative study of deep neural network based Punjabi-ASR system

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

HMM is regarded as the leader from last five decades for handling the temporal variability in an input speech signal for building automatic speech recognition system. GMM became an integral part of HMM so as to measure the efficiency of each state that stores the information of a short windowed frame. In order to systematically fit the frame, it reserves the frame coefficients and connects their posterior probability over HMM state that acts as an output. In this paper, deep neural network (DNN) is tested against the GMM through utilization of many hidden layers which helps the DNN to successfully evade the issue of overfitting on large training dataset before its performance becomes worse. The implementation DNN with robust feature extraction approach has brought a high performance margin in Punjabi speech recognition system. For feature extraction, the baseline MFCC and GFCC approaches are integrated with cepstral mean and variance normalization. The dimension reduction, decorrelation of vector information and speaker variability is later addressed with linear discriminant analysis, maximum likelihood linear transformation, SAT, maximum likelihood linear regression adaptation models. Two hybrid classifiers investigate the conceived acoustic feature vectors: GMM–HMM, and DNN–HMM to obtain improvement in performance on connected and continuous Punjabi speech corpus. Experimental setup shows a notable improvement of 4–5% and 1–3% (in connected and continuous datasets respectively).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Acero, A., & Stern, R. M. (1992). Cepstral normalization for robust speech recognition. In Speech processing in adverse conditions.

  • Bourlard, H., & Morgan, N. (1993). Connectionist speech recognition. A hybrid approach (Vol. 247). Boston: The Kluwer International Series in Engineering and Computer Science.

    Google Scholar 

  • Chen, X., & Cheng, J. (2014). Deep neural network acoustic modeling for native and non-native Mandarin speech recognition. In Proceedings of ISCSLP (pp. 6–9).

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2018). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-018-0828-x.

    Google Scholar 

  • Dua, M., Aggarwal, R. K., Kadyan, V., & Dua, S. (2012). Punjabi automatic speech recognition using HTK. International Journal of Computer Science Issues, 9(4), 359–364.

    Google Scholar 

  • Gales, M., & Woodland, P. (1996a). Mean and variance adaptation within the MLLR framework. Computer Speech & Language, 10, 249–264.

    Article  Google Scholar 

  • Gales, M. J. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12(2), 75–98.

    Article  Google Scholar 

  • Ghai, W., & Singh, N. (2013). Continuous speech recognition for Punjabi language. International Journal of Computer Applications, 72(14), 422–431.

    Article  Google Scholar 

  • Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Acoustics, speech, and signal processing, 1992. ICASSP-92., 1992 IEEE international conference on (Vol. 1, pp. 13–16). IEEE.

  • Hermansky, H., Ellis, D. P., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In Acoustics, speech, and signal processing, 2000. ICASSP’00. Proceedings. 2000 IEEE international conference on (Vol. 3, pp. 1635–1638). IEEE.

  • Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Juang, B. H., Levinson, S., & Sondhi, M. (1986). Maximum likelihood estimation for multivariate mixture observations of markov chains (corresp.). IEEE Transactions on Information Theory, 32(2), 307–309.

    Article  Google Scholar 

  • Kadyan, V., Mantri, A., & Aggarwal, R. K. (2017) Refinement of HMM model parameters for Punjabi Automatic Speech Recognition (PASR) System, IETE Journal of Research. https://doi.org/10.1080/03772063.2017.1369370.

    Google Scholar 

  • Kadyan, V., Mantri, V., & Aggarwal, R. K. (2017) Refinement of HMM model parameters for Punjabi Automatic Speech Recognition (PASR) System. IETE Journal of Research. https://doi.org/10.1080/03772063.2017.1369370.

    Google Scholar 

  • Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26(4), 283–297.

    Article  Google Scholar 

  • Kumar, Y., & Singh, N. (2017). An automatic speech recognition system for spontaneous Punjabi speech corpus. International Journal of Speech Technology, 20(2), 297–303.

    Article  Google Scholar 

  • Lata, S., & Arora, S. (2013) Laryngeal tonal characteristics of Punjabi—An experimental study. In 2015 2nd international conference on computing for sustainable global development (pp. 1694–1697).

  • Liu, F., Stern, R. M., Huang, X., & Acero, R. (1993). Efficient cepstral normalization for robust speech recognition. In Proceedings of the workshop on human language technology (pp. 69–74).

  • Matsoukas, S., Schwartz, R., Jin, H., & Nguyen, L. (1997). Practical implementations of speaker-adaptive training. In DARPA speech recognition workshop.

  • Mittal, S., & Sharma, R. K. (2014). Development of phonetic engine for Punjabi language (Doctoral dissertation), Thapar University, Patiala, India.

  • Mitra, V., Wang, W., Franco, H., Lei, Y., Bartels, C., & Graciarena, M. (2014). Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions. In Fifteenth annual conference of the international speech communication association.

  • Palaz, D., & Collobert, R. (2015). Analysis of cnn-based speech recognition system using raw speech as input (No. EPFL-REPORT-210039). Idiap.

  • Parthasarathi, S. H. K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., & Garimella, S. (2015). fMLLR based feature-space speaker adaptation of DNN acoustic models. In Sixteenth annual conference of the international speech communication association.

  • Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., & Rose, R. C. (2011). The subspace Gaussian mixture model—A structured model for speech recognition. Computer Speech & Language, 25(2), 404–439.

    Article  Google Scholar 

  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  • Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Upper Saddle River: Prentice-Hall Inc.

    Google Scholar 

  • Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Network, 61, 85–117.

    Article  Google Scholar 

  • Singh, A., Dipti, P., & Agrawal, S. S. (2015) Analysis of Punjabi tonemes. In Computing for Sustainable Global Development (INDIACom) (pp. 1–6).

  • Sivasankaran, S., Nugraha, A. A., Vincent, E., Morales-Cordovilla, J. A., Dalmia, S., Illina, I., et al. (2015). Robust ASR using neural network based speech enhancement and feature simulation. In IEEE workshop on automatic speech recognition and understanding (ASRU), 2015 (pp. 482–489).

Download references

Acknowledgements

This work is partially tested on the sample Punjabi corpus collected for Language Resources for Auditory impaired Person project from IEEE SIGHT. The views and results in the work is as per perspective of the research. The author would like to thank Speech and Multimodel Laboratory members Mandeep, Sashi, Nikhil at Chitkara University Punjab. Special thanks to Dr. Syed who helps in providing valuable input in formation of a baseline DNN system.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amitoj Singh.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kadyan, V., Mantri, A., Aggarwal, R.K. et al. A comparative study of deep neural network based Punjabi-ASR system. Int J Speech Technol 22, 111–119 (2019). https://doi.org/10.1007/s10772-018-09577-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-09577-3

Keywords

Navigation