Abstract
In this paper spectral and prosodic features extracted from different levels are explored for analyzing the language specific information present in speech. In this work, spectral features extracted from frames of 20 ms (block processing), individual pitch cycles (pitch synchronous analysis) and glottal closure regions are used for discriminating the languages. Prosodic features extracted from syllable, tri-syllable and multi-word (phrase) levels are proposed in addition to spectral features for capturing the language specific information. In this study, language specific prosody is represented by intonation, rhythm and stress features at syllable and tri-syllable (words) levels, whereas temporal variations in fundamental frequency (F 0 contour), durations of syllables and temporal variations in intensities (energy contour) are used to represent the prosody at multi-word (phrase) level. For analyzing the language specific information in the proposed features, Indian language speech database (IITKGP-MLILSC) is used. Gaussian mixture models are used to capture the language specific information from the proposed features. The evaluation results indicate that language identification performance is improved with combination of features. Performance of proposed features is also analyzed on standard Oregon Graduate Institute Multi-Language Telephone-based Speech (OGI-MLTS) database.













Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ambikairajah, E., Li, H., Wang, L., Yin, B., & Sethu, V. (2011). Language identification: a tutorial. IEEE Circuits and Systems Magazine, 11(2), 82–108.
Benesty, J., Sondhi, M. M., & Huang, Y. (2007). Springer handbook of speech processing. New York: Springer.
Bhaskararao, P. (2005). Salient phonetic features of Indian languages in speech technology. Sadhana, 36(5), 587–599.
Carrasquillo, P. A. T., Reynolds, D. A., & Deller, J. R. (2002). Language identification using Gaussian mixture model tokenization. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (Vol. I, pp. 757–760).
Cimarusti, D., & Eves, R. B. (1982). Development of an automatic identification system of spoken languages: phase I. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, May 1982 (pp. 1661–1663).
Cole, R. A., Inouye, J. W. T., Muthusamy, Y. K., & Gopalakrishnan, M. (1989). Language identification with neural networks: a feasibility study. In Proc. IEEE pacific rim conf. communications, computers and signal processing (pp. 525–529).
Corredor-Ardoy, C., Gauvain, J., Adda-Decker, M., & Lamel, L. (1997). Language identification with language-independent acoustic models. In Proc. EUROSPEECH-1997 (pp. 55–58).
Cummins, F., Gers, F., & Schmidhuber, J. (1999). Comparing prosody across languages. Tech. rep. I. D. S. I. A. Technical report IDSIA-07-99, Istituto Molle di Studie sull’Intelligenza Artificiale, CH6900 Lugano, Switzerland.
Cutler, A., & Ladd, D. R. (1983). Prosody: models and measurements. Berlin: Springer.
Dalsgaard, P., & Andersen, O. (1992). Identification of mono- and polyphonemes using acoustic-phonetic features derived by a self-organising neural network. In Proc. int. conf. spoken language processing (ICSLP- 1992) (pp. 547–550).
Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dordrecht: Kluwer Academic.
Ember, M., & Ember, C. R. (1999). Cross-language predictors of consonant-vowel syllables. American Anthropologist, 101, 730–742.
Gangashetty, S. V. (2005). Neural network models for recognition of consonant-vowel units of speech in multiple languages. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.
Gobbel, A. E. T., & Hutchins, S. E. (1996). On using prosodic cues in language identification. Proceedings of International Conference on Spoken Language Processing (ICSLP), 101, 1768–1772.
Guoliang, Z., Fang, Z., & Zhanjiang, S. (2001). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(16), 582–589.
Gussenhoven, C., Reepp, B. H., Rietveld, A., Rump, H. H., & Terken, J. (1997). The perceptual prominence of fundamental frequency peaks. The Journal of the Acoustical Society of America, 102, 3009–3022.
Hazen, T. J., & Zue, V. W. (1997). Segment-based automatic language identification. The Journal of the Acoustical Society of America, 101, 2323–2331.
Ives, R. (1986). A minimal rule AI expert system for real-time classification of natural spoken languages. In Proc. 2nd annual artificial intelligence and advanced computer technology conf. (pp. 337–340).
Jayaram, A. K. V. S., Ramasubramanian, V., & Sreenivas, T. V. (2003). Language identification using parallel sub-word recognition. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (Vol. I, pp. 32–35).
Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). Hierarchical language identification system for Indian languages. Digital Signal Processing, 22, 544–553.
Jyotsna, B., Murthy, H. A., & Nagarajan, T. (2000). Language identification from short segments of speech. In Proceedings of int. conf. spoken language processing, Beijing, China, Oct. 2000 (pp. 1033–1036).
Koolagudi, S. G., & Sreenivasa Rao, K. (2012). Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. International Journal of Speech Technology, 15(3), 495–511.
Krakow, R. A. (1999). Physiological organization of syllables: a review. Journal of Phonetics, 27, 23–54.
Kumar Vuppala, A., & Sreenivasa Rao, K. (2013). Vowel onset point detection for noisy speech using spectral energy at formant frequencies. International Journal of Speech Technology, 16(2), 229–235.
Kumar Vuppala, A., Yadav, J., Chakrabarti, S., & Sreenivasa Rao, K. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20, 1894–1903.
Lamel, L. F., & Gauvain, J. L. (1994). Language identification using phonebased acoustic likelihoods. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, Apr. 1994 (Vol. 1, pp. 293–296).
Lander, T., Cole, R., Oshika, B., & Noel, M. (1995). The OGI 22 language telephone speech corpus. In Proc. EUROSPEECH-1995 (pp. 817–820).
Lavanya, P., Kishore, P., & Madhavi, G. (2005). A simple approach for building transliteration editors for Indian languages. Journal of Zhejiang University. Science, 6A(11), 1354–1361.
LDC (1996). (LDC96S46 LDC96S60) Philadelphia, PA. http://www.ldc.upenn.edu/Catalog.
Leonard, R. G., & Doddington, G. R. (1974). Automatic language identification. Tech. Rep., A.F.R.A.D. Centre Tech. Rep. RADC-TR-74-200.
Lin, C. Y., & Wang, H. C. (2006). Language identification using pitch contour information in the ergodic Markov model. In Proc. 2006 IEEE int. conf. acoustics, speech and signal processing (ICASSP 2006).
Lu-Feng, Z., Man-hung, S., Xi, Y., & Gish, H. (2006). Discriminatively trained language models using support vector machines for language identification. In Proc. speaker and language recognition workshop, IEEE odyssey 2006 (pp. 1–6).
MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavial and Brain Sciences, 21, 499–546.
Mahadeva Prasanna, S. R., Gangashetty, S. V., & Yegnanarayana, B. (2001). Significance of vowel onset point for speech analysis. In Proc. int. conf. signal processing and communication, Bangalore, India, Jul. 2001 (Vol. 1, pp. 81–86).
Maity, S., Kumar Vuppala, A., Sreenivasa Rao, K., & Nandi, D. (2012). IITKGP-MLILSC speech database for language identification. In National conference on communications (NCC), Kharagpur, India, Feb. 2012 (pp. 1–3). New York: IEEE Press.
Man-Hung, S., Xi, Y., & Gish, H. (2009). Discriminatively trained GMMs for language classification using boosting methods. IEEE Transactions on Audio, Speech, and Language Processing, 17(1), 187–197.
Mart´ınez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In ICASSP.
Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.
Mary, L., & Yegnanarayana, B. (2004). Autoassociative neural network models for language identification. In Proc. int. conf. intelligent sensing and information processing, Chennai, India (pp. 317–320).
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.
Mary, L., Rao, K. S., & Yegnanarayana, B. (2005). Neural network classifiers for language identification using syntactic and prosodic features. In Proc. IEEE int. conf. intelligent sensing and information processing, Chennai, India, Jan. 2005 (pp. 404–408).
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.
Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Proceedings of int. conf. spoken language processing (pp. 895–898).
Nagarajan, T., & Murthy, H. A. (2002). Language identification using spectral vector distribution across languages. In Proc. international conference on natural language processing (pp. 327–335).
Nakagawa, S., Ueda, Y., & Seino, T. (1992). Speaker-independent, text independent language identification by HMM. In Proc. int. conf. spoken language processing (ICSLP-1992) (pp. 1011–1014).
Navratil, J. (2001). Spoken language recognition a step toward multilinguality in speech processing. IEEE Transactions on Speech and Audio Processing, 9(6), 678–685.
Nayeemulla Khan, A., Gangashetty, S. V., & Yegnanarayana, B. (2003). Syllabic properties of three Indian languages: implications for speech recognition and language identification. In Proc. int. conf. natural language processing, Mysore, India, Dec. 2003 (pp. 125–134).
Ohman, S. E. G. (1966). Coarticulation in VCV utterances: spectrographic measurements. The Journal of the Acoustical Society of America, 39, 151–168.
Pellegrino, F., & Andre-Abrecht, R. (1999). An unsupervised approach to language identification. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (pp. 833–836).
Pellegrino, F., Farinas, J., & André-Obrecht, R. (1999). Comparison of two phonetic approaches to language identification. In Proc. EUROSPEECH’99 (pp. 399–402).
Ramasubramanian, V., Sai Jayaram, A. K. V., & Sreenivas, T. V. (2003). Language identification using parallel phone recognition. In WSLP, TIFR, Mumbai, Jan. 2003 (pp. 109–116).
Ramus, F., & Mehler, J. (1999). Language identification with suprasegmental cues: a study based on speech resynthesis. The Journal of the Acoustical Society of America, 105, 512–521.
Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in speech signal. Cognition, 73(3), 265–292.
Rao, K. S. (2005). Acquisition and incorporation prosody knowledge for speech systems in Indian languages. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, May 2005.
Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech & Language, 24(1), 474–494.
Rao, K.S. (2012). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33.
Rao, K. S., & Vuppala, A. K. (2013). Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Communication, 55, 745–756.
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980.
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech & Language, 21, 282–295.
Rao, K. S., & Yegnanarayana, B. (2009a). Intonation modeling for Indian languages. In International conference on spoken language processing (ICSLP) (pp. 733–736).
Rao, K. S., & Yegnanarayana, B. (2009b). Intonation modeling for Indian languages. Computer Speech & Language, 23(2), 240–256.
Rao, K. S., & Yegnanarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51, 1263–1269.
Rao, K. S., Vuppala, A. K. & Chakrabarti, S. (2012). Spotting and recognition of consonant-vowel units from continuous speech using accurate vowel onset points. Circuits, Systems, and Signal Processing, 31(4), 1459–1474.
Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17(1–2), 91–108.
Riek, L., Mistreta, W., & Morgan, D. (1991). Experiments in language identification. Tech. Rep., Lockheed Sanders Tech. Rep. SPCOT-91-002.
Rouas, J. L. (2007). Automatic prosodic variations modeling for language and dialect discrimination. IEEE Transactions on Audio, Speech, and Language Processing, 15, 1904–1911.
Rouas, J.-L., Farinas, J., Pellegrino, F., & André-Obrecht, R. (2005). Rhythmic unit extraction and modelling for automatic language identification. Speech Communication, 47, 436–456.
Sangwan, A., Mehrabani, M., & Hansen, J. H. L. (2010). Automatic language analysis and identification based on speech production knowledge. In ICASSP.
Sekhar, C. C. (1996). Neural network models for recognition of stop consonant-vowel (SCV) segments in continuous speech. Ph.D. thesis, Indian Institute of Technology Madras, Department of Computer Science and Engg, Chennai, India.
Shriberg, E., Stolcke, A., Hakkani-Tur, D., & Tur, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32, 127–154.
Sreenivasa Rao, K., Maity, S., & Ramu Reddy, V. (2013). Pitch synchronous and glottal closure based speech analysis for language recognition. International Journal of Speech Technology. doi:10.1007/s10772-013-9193-5.
Taylor, P. (2000). Analysis and synthesis of intonation using the tilt model. The Journal of the Acoustical Society of America, 107, 1697–1714.
Ueda, Y., & Nakagawa, S. (1990). Diction for phoneme/syllable/word-category and identification of language using HMM. In Proc. int. conf. spoken language processing (ICSLP-1990) (pp. 1209–1212).
Wong, E., & Sridharan, S. (2002). Gaussian mixture model based language identification system. In Proc. int. conf. spoken language processing (ICSLP-2002) (pp. 93–96).
Xu, Y. (1998). Consistency of tone-syllable alignment across different syllable structures and speaking rates. Phonetica, 55, 179–203.
Zissman, M. A. (1993). Automatic langauge identification using Gaussian mixture and hidden Markov models. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, Apr. 1993 (pp. 399–402).
Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on Speech and Audio Processing, 4, 31–44.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ramu Reddy, V., Maity, S. & Sreenivasa Rao, K. Identification of Indian languages using multi-level spectral and prosodic features. Int J Speech Technol 16, 489–511 (2013). https://doi.org/10.1007/s10772-013-9198-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-013-9198-0