Skip to main content
Log in

Combining evidences from excitation source and vocal tract system features for Indian language identification using deep neural networks

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper, a combination of excitation source information and vocal tract system information is explored for the task of language identification (LID). The excitation source information is represented by features extracted from linear prediction (LP) residual signal called the residual cepstral coefficients (RCC). Vocal tract system information is represented by the mel frequency cepstral coefficients (MFCC). In order to incorporate additional temporal information, shifted delta cepstra (SDC) are computed. An LID system is built using SDC over both MFCC and RCC features individually and evaluated based on their equal error rate (EER). Experiments have been performed on a dataset consisting of 13 Indian languages with about 115 h for training and 30 h for testing using a deep neural network (DNN), DNN with attention (DNN-WA) and a state-of-the-art i-vector system. DNN-WA outperforms the baseline i-vector system. An EER of 9.93 and 6.25% are achieved using RCC and MFCC features respectively. By combining evidence from both features using a late fusion mechanism, an EER of 5.76% is obtained. This result indicates the complementary nature of the excitation source information to that of the widely used vocal tract system information for the task of LID.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Al-Talabani, A., Sellahewa, H., & Jassim, S. (2013). Excitation source and low level descriptor features fusion for emotion recognition using SVM and ANN. In Proceedings of Computer Science and Electronic Engineering Conference (CEEC) (pp. 156–161). IEEE.

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In Proceedings of International Conference on Learning Representations (ICLR).

  • Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127.

    Article  MATH  Google Scholar 

  • Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D. A., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In INTERSPEECH (pp. 857–860).

  • Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, Ar, Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Lakhsmi, H. R., Achanta, S., Bhavya, P. V., & Gangashetty, S. V. (2016). An investigation of end-to-end speaker recognition using deep neural networks. International Journal of Engineering Research in Electronic and Communication Engineering, 3(1), 42–47.

    Google Scholar 

  • Leena, M., Rao, K. S., & Yegnanarayana, B. (2005). Neural network classifiers for language identification using phonotactic and prosodic features. In Proceedings of International Conference on Intelligent Sensing and Information Processing (pp. 404–408). IEEE.

  • Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., & Moreno, P. (2014). Automatic language identification using deep neural networks. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 5337–5341). IEEE.

  • Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50(10), 782–796.

    Article  Google Scholar 

  • Mounika, K. V., Achanta, S., Lakshmi, H. R., Gangashetty, S. V., & Kumar Vuppala, A. (2016). An investigation of deep neural network architectures for language recognition in Indian languages. In INTERSPEECH (pp. 2930–2933).

  • Muthusamy, Y. K., Barnard, E., & Cole, R. A. (1994). Reviewing automatic language identification. IEEE Signal Processing Magazine, 11(4), 33–41. https://doi.org/10.1109/79.317925.

    Article  Google Scholar 

  • Nandi, D., Pati, D., & Rao, K. S. (2014). Sub-segmental, segmental and supra-segmental analysis of linear prediction residual signal for language identification. In Proceedings of International Conference on Signal Processing and Communications (SPCOM) (pp. 1–6). IEEE.

  • Pati, D., & Prasana, S. R. M. (2011). Subsegmental, segmental and suprasegmental processing of linear prediction residual for speaker information. International Journal of Speech Technology, 14(1), 49–64.

    Article  Google Scholar 

  • Raffel, C., & Ellis, D. P. W. (2015). Feed-forward networks with attention can solve some long-term memory problems. http://arxiv.org/abs/1512.08756.

  • Rao, K. S., & Nandi, D. (2015). Implicit excitation source features for language identification. In Language Identification Using Excitation Source Features (pp. 31–51). New York: Springer.

  • Richardson, F., Reynolds, D., & Dehak, N. (2015). A unified deep neural network for speaker and language recognition. In INTERSPEECH (pp. 1146–1150).

  • Torres-Carrasquillo, P. A., et al. (2002). Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In INTERSPEECH.

  • Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller, J. R., Jr. (2002). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In INTERSPEECH.

  • Vuppala, A. K., Mounika, K. V., & Vydana, H. K. (2015). Significance of speech enhancement and sonorant regions of speech for robust language identification. In Proceedings of Signal Processing, Informatics, Communication and Energy Systems (SPICES) (pp. 1–5). IEEE.

  • Yegnanarayana, B., Prasana, S. R. M., & Rao, K. S. (2002). Speech enhancement using excitation source information. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (vol. 1, pp. I–541). IEEE.

  • Zeiler, M., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., & Hinton, G. (2013). On rectified linear units for speech processing. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 3517–3521). IEEE.

Download references

Acknowledgements

The authors would like to thank Science & Engineering Research Board (SERB) for funding “Language Identification in Practical Environments (YSS/2014/000933)” project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mounika Kamsali Veera.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kamsali Veera, M., Vuddagiri, R., Gangashetty, S.V. et al. Combining evidences from excitation source and vocal tract system features for Indian language identification using deep neural networks. Int J Speech Technol 21, 501–508 (2018). https://doi.org/10.1007/s10772-017-9481-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-017-9481-6

Keywords

Navigation