Skip to main content
Log in

Identification of Indian languages using multi-level spectral and prosodic features

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper spectral and prosodic features extracted from different levels are explored for analyzing the language specific information present in speech. In this work, spectral features extracted from frames of 20 ms (block processing), individual pitch cycles (pitch synchronous analysis) and glottal closure regions are used for discriminating the languages. Prosodic features extracted from syllable, tri-syllable and multi-word (phrase) levels are proposed in addition to spectral features for capturing the language specific information. In this study, language specific prosody is represented by intonation, rhythm and stress features at syllable and tri-syllable (words) levels, whereas temporal variations in fundamental frequency (F 0 contour), durations of syllables and temporal variations in intensities (energy contour) are used to represent the prosody at multi-word (phrase) level. For analyzing the language specific information in the proposed features, Indian language speech database (IITKGP-MLILSC) is used. Gaussian mixture models are used to capture the language specific information from the proposed features. The evaluation results indicate that language identification performance is improved with combination of features. Performance of proposed features is also analyzed on standard Oregon Graduate Institute Multi-Language Telephone-based Speech (OGI-MLTS) database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Ambikairajah, E., Li, H., Wang, L., Yin, B., & Sethu, V. (2011). Language identification: a tutorial. IEEE Circuits and Systems Magazine, 11(2), 82–108.

    Article  Google Scholar 

  • Benesty, J., Sondhi, M. M., & Huang, Y. (2007). Springer handbook of speech processing. New York: Springer.

    Google Scholar 

  • Bhaskararao, P. (2005). Salient phonetic features of Indian languages in speech technology. Sadhana, 36(5), 587–599.

    Article  Google Scholar 

  • Carrasquillo, P. A. T., Reynolds, D. A., & Deller, J. R. (2002). Language identification using Gaussian mixture model tokenization. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (Vol. I, pp. 757–760).

    Google Scholar 

  • Cimarusti, D., & Eves, R. B. (1982). Development of an automatic identification system of spoken languages: phase I. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, May 1982 (pp. 1661–1663).

    Google Scholar 

  • Cole, R. A., Inouye, J. W. T., Muthusamy, Y. K., & Gopalakrishnan, M. (1989). Language identification with neural networks: a feasibility study. In Proc. IEEE pacific rim conf. communications, computers and signal processing (pp. 525–529).

    Chapter  Google Scholar 

  • Corredor-Ardoy, C., Gauvain, J., Adda-Decker, M., & Lamel, L. (1997). Language identification with language-independent acoustic models. In Proc. EUROSPEECH-1997 (pp. 55–58).

    Google Scholar 

  • Cummins, F., Gers, F., & Schmidhuber, J. (1999). Comparing prosody across languages. Tech. rep. I. D. S. I. A. Technical report IDSIA-07-99, Istituto Molle di Studie sull’Intelligenza Artificiale, CH6900 Lugano, Switzerland.

  • Cutler, A., & Ladd, D. R. (1983). Prosody: models and measurements. Berlin: Springer.

    Book  Google Scholar 

  • Dalsgaard, P., & Andersen, O. (1992). Identification of mono- and polyphonemes using acoustic-phonetic features derived by a self-organising neural network. In Proc. int. conf. spoken language processing (ICSLP- 1992) (pp. 547–550).

    Google Scholar 

  • Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dordrecht: Kluwer Academic.

    Book  Google Scholar 

  • Ember, M., & Ember, C. R. (1999). Cross-language predictors of consonant-vowel syllables. American Anthropologist, 101, 730–742.

    Article  Google Scholar 

  • Gangashetty, S. V. (2005). Neural network models for recognition of consonant-vowel units of speech in multiple languages. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.

  • Gobbel, A. E. T., & Hutchins, S. E. (1996). On using prosodic cues in language identification. Proceedings of International Conference on Spoken Language Processing (ICSLP), 101, 1768–1772.

    Article  Google Scholar 

  • Guoliang, Z., Fang, Z., & Zhanjiang, S. (2001). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(16), 582–589.

    MATH  Google Scholar 

  • Gussenhoven, C., Reepp, B. H., Rietveld, A., Rump, H. H., & Terken, J. (1997). The perceptual prominence of fundamental frequency peaks. The Journal of the Acoustical Society of America, 102, 3009–3022.

    Article  Google Scholar 

  • Hazen, T. J., & Zue, V. W. (1997). Segment-based automatic language identification. The Journal of the Acoustical Society of America, 101, 2323–2331.

    Article  Google Scholar 

  • Ives, R. (1986). A minimal rule AI expert system for real-time classification of natural spoken languages. In Proc. 2nd annual artificial intelligence and advanced computer technology conf. (pp. 337–340).

    Google Scholar 

  • Jayaram, A. K. V. S., Ramasubramanian, V., & Sreenivas, T. V. (2003). Language identification using parallel sub-word recognition. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (Vol. I, pp. 32–35).

    Google Scholar 

  • Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). Hierarchical language identification system for Indian languages. Digital Signal Processing, 22, 544–553.

    Article  MathSciNet  Google Scholar 

  • Jyotsna, B., Murthy, H. A., & Nagarajan, T. (2000). Language identification from short segments of speech. In Proceedings of int. conf. spoken language processing, Beijing, China, Oct. 2000 (pp. 1033–1036).

    Google Scholar 

  • Koolagudi, S. G., & Sreenivasa Rao, K. (2012). Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. International Journal of Speech Technology, 15(3), 495–511.

    Article  Google Scholar 

  • Krakow, R. A. (1999). Physiological organization of syllables: a review. Journal of Phonetics, 27, 23–54.

    Article  Google Scholar 

  • Kumar Vuppala, A., & Sreenivasa Rao, K. (2013). Vowel onset point detection for noisy speech using spectral energy at formant frequencies. International Journal of Speech Technology, 16(2), 229–235.

    Article  Google Scholar 

  • Kumar Vuppala, A., Yadav, J., Chakrabarti, S., & Sreenivasa Rao, K. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20, 1894–1903.

    Article  Google Scholar 

  • Lamel, L. F., & Gauvain, J. L. (1994). Language identification using phonebased acoustic likelihoods. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, Apr. 1994 (Vol. 1, pp. 293–296).

    Google Scholar 

  • Lander, T., Cole, R., Oshika, B., & Noel, M. (1995). The OGI 22 language telephone speech corpus. In Proc. EUROSPEECH-1995 (pp. 817–820).

    Google Scholar 

  • Lavanya, P., Kishore, P., & Madhavi, G. (2005). A simple approach for building transliteration editors for Indian languages. Journal of Zhejiang University. Science, 6A(11), 1354–1361.

    Article  Google Scholar 

  • LDC (1996). (LDC96S46 LDC96S60) Philadelphia, PA. http://www.ldc.upenn.edu/Catalog.

  • Leonard, R. G., & Doddington, G. R. (1974). Automatic language identification. Tech. Rep., A.F.R.A.D. Centre Tech. Rep. RADC-TR-74-200.

  • Lin, C. Y., & Wang, H. C. (2006). Language identification using pitch contour information in the ergodic Markov model. In Proc. 2006 IEEE int. conf. acoustics, speech and signal processing (ICASSP 2006).

    Google Scholar 

  • Lu-Feng, Z., Man-hung, S., Xi, Y., & Gish, H. (2006). Discriminatively trained language models using support vector machines for language identification. In Proc. speaker and language recognition workshop, IEEE odyssey 2006 (pp. 1–6).

    Google Scholar 

  • MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavial and Brain Sciences, 21, 499–546.

    Google Scholar 

  • Mahadeva Prasanna, S. R., Gangashetty, S. V., & Yegnanarayana, B. (2001). Significance of vowel onset point for speech analysis. In Proc. int. conf. signal processing and communication, Bangalore, India, Jul. 2001 (Vol. 1, pp. 81–86).

    Google Scholar 

  • Maity, S., Kumar Vuppala, A., Sreenivasa Rao, K., & Nandi, D. (2012). IITKGP-MLILSC speech database for language identification. In National conference on communications (NCC), Kharagpur, India, Feb. 2012 (pp. 1–3). New York: IEEE Press.

    Chapter  Google Scholar 

  • Man-Hung, S., Xi, Y., & Gish, H. (2009). Discriminatively trained GMMs for language classification using boosting methods. IEEE Transactions on Audio, Speech, and Language Processing, 17(1), 187–197.

    Article  Google Scholar 

  • Mart´ınez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In ICASSP.

    Google Scholar 

  • Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.

  • Mary, L., & Yegnanarayana, B. (2004). Autoassociative neural network models for language identification. In Proc. int. conf. intelligent sensing and information processing, Chennai, India (pp. 317–320).

    Google Scholar 

  • Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.

    Article  Google Scholar 

  • Mary, L., Rao, K. S., & Yegnanarayana, B. (2005). Neural network classifiers for language identification using syntactic and prosodic features. In Proc. IEEE int. conf. intelligent sensing and information processing, Chennai, India, Jan. 2005 (pp. 404–408).

    Google Scholar 

  • Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.

    Article  Google Scholar 

  • Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Proceedings of int. conf. spoken language processing (pp. 895–898).

    Google Scholar 

  • Nagarajan, T., & Murthy, H. A. (2002). Language identification using spectral vector distribution across languages. In Proc. international conference on natural language processing (pp. 327–335).

    Google Scholar 

  • Nakagawa, S., Ueda, Y., & Seino, T. (1992). Speaker-independent, text independent language identification by HMM. In Proc. int. conf. spoken language processing (ICSLP-1992) (pp. 1011–1014).

    Google Scholar 

  • Navratil, J. (2001). Spoken language recognition a step toward multilinguality in speech processing. IEEE Transactions on Speech and Audio Processing, 9(6), 678–685.

    Article  Google Scholar 

  • Nayeemulla Khan, A., Gangashetty, S. V., & Yegnanarayana, B. (2003). Syllabic properties of three Indian languages: implications for speech recognition and language identification. In Proc. int. conf. natural language processing, Mysore, India, Dec. 2003 (pp. 125–134).

    Google Scholar 

  • Ohman, S. E. G. (1966). Coarticulation in VCV utterances: spectrographic measurements. The Journal of the Acoustical Society of America, 39, 151–168.

    Article  Google Scholar 

  • Pellegrino, F., & Andre-Abrecht, R. (1999). An unsupervised approach to language identification. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (pp. 833–836).

    Google Scholar 

  • Pellegrino, F., Farinas, J., & André-Obrecht, R. (1999). Comparison of two phonetic approaches to language identification. In Proc. EUROSPEECH’99 (pp. 399–402).

    Google Scholar 

  • Ramasubramanian, V., Sai Jayaram, A. K. V., & Sreenivas, T. V. (2003). Language identification using parallel phone recognition. In WSLP, TIFR, Mumbai, Jan. 2003 (pp. 109–116).

    Google Scholar 

  • Ramus, F., & Mehler, J. (1999). Language identification with suprasegmental cues: a study based on speech resynthesis. The Journal of the Acoustical Society of America, 105, 512–521.

    Article  Google Scholar 

  • Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in speech signal. Cognition, 73(3), 265–292.

    Article  Google Scholar 

  • Rao, K. S. (2005). Acquisition and incorporation prosody knowledge for speech systems in Indian languages. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, May 2005.

  • Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech & Language, 24(1), 474–494.

    Article  Google Scholar 

  • Rao, K.S. (2012). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33.

    Article  Google Scholar 

  • Rao, K. S., & Vuppala, A. K. (2013). Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Communication, 55, 745–756.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech & Language, 21, 282–295.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2009a). Intonation modeling for Indian languages. In International conference on spoken language processing (ICSLP) (pp. 733–736).

    Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2009b). Intonation modeling for Indian languages. Computer Speech & Language, 23(2), 240–256.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51, 1263–1269.

    Article  Google Scholar 

  • Rao, K. S., Vuppala, A. K. & Chakrabarti, S. (2012). Spotting and recognition of consonant-vowel units from continuous speech using accurate vowel onset points. Circuits, Systems, and Signal Processing, 31(4), 1459–1474.

    Article  Google Scholar 

  • Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17(1–2), 91–108.

    Article  Google Scholar 

  • Riek, L., Mistreta, W., & Morgan, D. (1991). Experiments in language identification. Tech. Rep., Lockheed Sanders Tech. Rep. SPCOT-91-002.

  • Rouas, J. L. (2007). Automatic prosodic variations modeling for language and dialect discrimination. IEEE Transactions on Audio, Speech, and Language Processing, 15, 1904–1911.

    Article  Google Scholar 

  • Rouas, J.-L., Farinas, J., Pellegrino, F., & André-Obrecht, R. (2005). Rhythmic unit extraction and modelling for automatic language identification. Speech Communication, 47, 436–456.

    Article  Google Scholar 

  • Sangwan, A., Mehrabani, M., & Hansen, J. H. L. (2010). Automatic language analysis and identification based on speech production knowledge. In ICASSP.

    Google Scholar 

  • Sekhar, C. C. (1996). Neural network models for recognition of stop consonant-vowel (SCV) segments in continuous speech. Ph.D. thesis, Indian Institute of Technology Madras, Department of Computer Science and Engg, Chennai, India.

  • Shriberg, E., Stolcke, A., Hakkani-Tur, D., & Tur, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32, 127–154.

    Article  Google Scholar 

  • Sreenivasa Rao, K., Maity, S., & Ramu Reddy, V. (2013). Pitch synchronous and glottal closure based speech analysis for language recognition. International Journal of Speech Technology. doi:10.1007/s10772-013-9193-5.

    Google Scholar 

  • Taylor, P. (2000). Analysis and synthesis of intonation using the tilt model. The Journal of the Acoustical Society of America, 107, 1697–1714.

    Article  Google Scholar 

  • Ueda, Y., & Nakagawa, S. (1990). Diction for phoneme/syllable/word-category and identification of language using HMM. In Proc. int. conf. spoken language processing (ICSLP-1990) (pp. 1209–1212).

    Google Scholar 

  • Wong, E., & Sridharan, S. (2002). Gaussian mixture model based language identification system. In Proc. int. conf. spoken language processing (ICSLP-2002) (pp. 93–96).

    Google Scholar 

  • Xu, Y. (1998). Consistency of tone-syllable alignment across different syllable structures and speaking rates. Phonetica, 55, 179–203.

    Article  Google Scholar 

  • Zissman, M. A. (1993). Automatic langauge identification using Gaussian mixture and hidden Markov models. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, Apr. 1993 (pp. 399–402).

    Chapter  Google Scholar 

  • Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on Speech and Audio Processing, 4, 31–44.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Sreenivasa Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramu Reddy, V., Maity, S. & Sreenivasa Rao, K. Identification of Indian languages using multi-level spectral and prosodic features. Int J Speech Technol 16, 489–511 (2013). https://doi.org/10.1007/s10772-013-9198-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-013-9198-0

Keywords

Navigation