Skip to main content
Log in

Optimal prosodic feature extraction and classification in parametric excitation source information for Indian language identification using neural network based Q-learning algorithm

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Automatic language identification (LID) system has extensively recognized in a real world multilanguage speech specific applications. The formation speech is relying on the vocal tract area which explores the excitation source information for LID task. In this paper, LID system utilizes sub segmental, segmental and supra segmental features from Linear Prediction residual of speech signal, represents various native language speech excitation source information. The glottal flow derivative of speech signal is obtained through iterative adaptive inverse filtering method. Moreover, the prosodic features of speech signal are extracted using short time Fourier transform due to its capability to process non-stationary signals. Finally, the deep neural network based Q-learning (DNNQL) algorithm has been employed for identification of the class label for a specific language. Experimental validation of the proposed approach is carried out using Indian language recorded database. Finally, the proposed LID system approach is performing well with 97.3% accuracy compared to other machine learning based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Ambikairajah, E., Li, H., Wang, L., Yin, B., & Sethu, V. (2011). Language identification: A tutorial. IEEE Circuits and Systems Magazine, 11(2), 82–108.

    Article  Google Scholar 

  • Bouguelia, M. R., Nowaczyk, S., Santosh, K. C., & Verikas, A. (2018). Agreeing to disagree: Active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics, 9(8), 1307–1319.

    Article  Google Scholar 

  • Dey, N., & Ashour, A. S. (2018a). Applied examples and applications of localization and tracking problem of multiple speech sources. In Direction of arrival estimation and localization of multi-speech sources (pp. 35–48). Cham: Springer.

  • Dey, N., & Ashour, A. S. (2018b). Sources localization and DOAE techniques of moving multiple sources. In Direction of arrival estimation and localization of multi-speech sources (pp. 23–34). Cham: Springer.

  • Dey, N., & Ashour, A. S. (2018c). Challenges and future perspectives in speech-sources direction of arrival estimation and localization. In Direction of arrival estimation and localization of multi-speech sources (pp. 49–52). Cham: Springer.

  • Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, J. L., & Bordel, G. (2012) On the use of phone log-likelihood ratios as features in spoken language recognition. In Spoken language technology workshop (SLT), 2012 IEEE (pp. 274–279). IEEE.

  • Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, J. L., & Bordel, G. (2013) Dimensionality reduction of phone log-likelihood ratio features for spoken language recognition. In INTERSPEECH (pp. 64–68).

  • Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, J. L., & Bordel, G. (2014). On the projection of PLLRs for unbounded feature distributions in spoken language recognition. IEEE Signal Processing Letters, 21(9), 1073–1077.

    Article  Google Scholar 

  • Ferrer, L., Lei, Y., McLaren, M., & Scheffer, N. (2016). Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(1), 105–116.

    Article  Google Scholar 

  • Gamallo, P., Pichel, J. R., & Alegria, I. (2017). From language identification to language distance. Physica A: Statistical Mechanics and its Applications, 484, 152–162.

    Article  Google Scholar 

  • Gonzalez-Dominguez, J., Lopez-Moreno, I., Moreno, P. J., & Gonzalez-Rodriguez, J. (2015). Frame-by-frame language identification in short utterances using deep neural networks. Neural Networks, 64, 49–58.

    Article  Google Scholar 

  • Guijarrubia, V. G., & Torres, M. I. (2010). Text-and speech-based phonotactic models for spoken language identification of Basque and Spanish. Pattern Recognition Letters, 31(6), 523–532.

    Article  Google Scholar 

  • Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). A hierarchical language identification system for Indian languages. Digital Signal Processing, 22(3), 544–553.

    Article  MathSciNet  Google Scholar 

  • Kockmann, M., & Burget, L. (2011). Application of speaker-and language identification state-of-the-art techniques for emotion recognition. Speech Communication, 53(9), 1172–1185.

    Article  Google Scholar 

  • Koolagudi, S., Rastogi, G., D., and Rao, K. S. (2012) Identification of language using mel-frequency cepstral coefficients (MFCC). Procedia Engineering, 38, 3391–3398.

    Article  Google Scholar 

  • Li, H., Ma, B., & Lee, K. A. (2013) Spoken language recognition: from fundamentals to practice. Proceedings of the IEEE, 101(5), 1136–1159.

    Article  Google Scholar 

  • Lopez-Moreno, I., Gonzalez-Dominguez, J., Martinez, D., Plchot, O., Gonzalez-Rodriguez, J., & Moreno, P. J. (2016). On the use of deep feed forward neural networks for automatic language identification. Computer Speech & Language, 40, 46–59.

    Article  Google Scholar 

  • Lu, X., Shen, P., Tsao, Y., & Kawai, H. (2017). Regularization of neural network model with distance metric learning for i-vector based spoken language identification. Computer Speech & Language, 44, 48–60.

    Article  Google Scholar 

  • Manchala, S., Prasad, V. K., & Janaki, V. (2014). GMM based language identification system using robust features. International Journal of Speech Technology, 17(2), 99–105.

    Article  Google Scholar 

  • Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50(10), 782–796.

    Article  Google Scholar 

  • Masumura, R., Asami, T., Masataki, H., & Aono, Y. (2017) Parallel phonetically aware DNNs and LSTM-RNNS for frame-by-frame discriminative modeling of spoken language identification. In 2017 IEEE international conference on IEEE acoustics, speech and signal processing (ICASSP) (pp. 5260–5264).

  • Mounika, K. V., Achanta, S., Lakshmi, H. R., Gangashetty, S. V., & Vuppala, A. K. (2016) An investigation of deep neural network architectures for language recognition in Indian languages. In INTERSPEECH (pp. 2930–2933).

  • Mukherjee, H., Obaidullah, S. M., Santosh, K. C., Phadikar, S., & Roy, K. (2018). Line spectral frequency-based features and extreme learning machine for voice activity detection from audio signal. International Journal of Speech Technology. https://doi.org/10.1007/s10772-018-9525-6.

    Google Scholar 

  • Orfanidou, E., Adam, R., Morgan, G., & McQueen, J. M. (2010). Recognition of signed and spoken language: Different sensory inputs, the same segmentation procedure. Journal of Memory and Language, 62(3), 272–283.

    Article  Google Scholar 

  • Roy, P., & Das, P. K. (2013). A hybrid VQ-GMM approach for identifying Indian languages. International Journal of Speech Technology, 16, 33–39.

    Article  Google Scholar 

  • Sadjadi, S. O., & Hansen, J. H. (2015). Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Communication, 72, 138–148.

    Article  Google Scholar 

  • Sim, K. C., & Li, H. (2008). On acoustic diversification front-end for spoken language identification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 1029–1037.

    Article  Google Scholar 

  • Sizov, A., Lee, K. A., & Kinnunen, T. (2017) Direct optimization of the detection cost for I-vector-based spoken language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3), 588–597.

    Article  Google Scholar 

  • Song, Y., Hong, X., Jiang, B., Cui, R., McLoughlin, I., & Dai, L. R. (2015), Deep bottleneck network based i-vector representation for language identification. In Sixteenth annual conference of the International Speech Communication Association.

  • Takçı, H., & Güngör, T. (2012). A high performance centroid-based classification approach for language identification. Pattern Recognition Letters, 33(16), 2077–2084.

    Article  Google Scholar 

  • Tanaka, T., Shinozaki, T., Watanabe, S., & Hori, T. (2016). Evolution strategy based neural network optimization and LSTM language model for robust speech recognition. Cit. on, 130.

  • Tong, R., Ma, B., Li, H., & Chng, E. S. (2009). A target-oriented phonotactic front-end for spoken language recognition. IEEE Transactions on Audio, Speech, and Language Processing, 17(7), 1335–1347.

    Article  Google Scholar 

  • Trabelsi, I., & Bouhlel, M. S. (2017) Feature selection for GUMI kernel-based SVM in speech emotion recognition. In Artificial intelligence: Concepts, methodologies, tools, and applications (pp. 941–953). IGI Global.

  • Wang, H., Leun, C.-C., Lee, T., Ma, B., & Li, H. (2013). Shifted-delta mlp features for spoken language recognition. IEEE Signal Processing Letters, 20(1), 15–18.

    Article  Google Scholar 

  • Zazo, R., Lozano-Diez, A., Gonzalez-Dominguez, J., Toledano, D. T., & Gonzalez-Rodriguez, J. (2016) Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PloS ONE, 11(1), e0146917.

  • Zhu, D., Li, H., Ma, B., & Lee, C.-H. (2008). Optimizing the performance of spoken language recognition with discriminative training. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1642–1653.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Himanish Shekhar Das.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Das, H.S., Roy, P. Optimal prosodic feature extraction and classification in parametric excitation source information for Indian language identification using neural network based Q-learning algorithm. Int J Speech Technol 22, 67–77 (2019). https://doi.org/10.1007/s10772-018-09582-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-09582-6

Keywords

Navigation