Identification of Indian languages using multi-level spectral and prosodic features

Ramu Reddy, V.; Maity, Sudhamay; Sreenivasa Rao, K.

doi:10.1007/s10772-013-9198-0

Identification of Indian languages using multi-level spectral and prosodic features

Published: 31 May 2013

Volume 16, pages 489–511, (2013)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

V. Ramu Reddy¹,
Sudhamay Maity² &
K. Sreenivasa Rao²

655 Accesses
Explore all metrics

Abstract

In this paper spectral and prosodic features extracted from different levels are explored for analyzing the language specific information present in speech. In this work, spectral features extracted from frames of 20 ms (block processing), individual pitch cycles (pitch synchronous analysis) and glottal closure regions are used for discriminating the languages. Prosodic features extracted from syllable, tri-syllable and multi-word (phrase) levels are proposed in addition to spectral features for capturing the language specific information. In this study, language specific prosody is represented by intonation, rhythm and stress features at syllable and tri-syllable (words) levels, whereas temporal variations in fundamental frequency (F ₀ contour), durations of syllables and temporal variations in intensities (energy contour) are used to represent the prosody at multi-word (phrase) level. For analyzing the language specific information in the proposed features, Indian language speech database (IITKGP-MLILSC) is used. Gaussian mixture models are used to capture the language specific information from the proposed features. The evaluation results indicate that language identification performance is improved with combination of features. Performance of proposed features is also analyzed on standard Oregon Graduate Institute Multi-Language Telephone-based Speech (OGI-MLTS) database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Acoustic Feature Analysis and Discriminative Modeling for Language Identification of Closely Related South-Asian Languages

Article 04 December 2017

Spoken Language Identification of Indian Languages Using MFCC Features

A Pre-classification-Based Language Identification for Northeast Indian Languages Using Prosody and Spectral Features

Article 12 October 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Ambikairajah, E., Li, H., Wang, L., Yin, B., & Sethu, V. (2011). Language identification: a tutorial. IEEE Circuits and Systems Magazine, 11(2), 82–108.
Article Google Scholar
Benesty, J., Sondhi, M. M., & Huang, Y. (2007). Springer handbook of speech processing. New York: Springer.
Google Scholar
Bhaskararao, P. (2005). Salient phonetic features of Indian languages in speech technology. Sadhana, 36(5), 587–599.
Article Google Scholar
Carrasquillo, P. A. T., Reynolds, D. A., & Deller, J. R. (2002). Language identification using Gaussian mixture model tokenization. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (Vol. I, pp. 757–760).
Google Scholar
Cimarusti, D., & Eves, R. B. (1982). Development of an automatic identification system of spoken languages: phase I. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, May 1982 (pp. 1661–1663).
Google Scholar
Cole, R. A., Inouye, J. W. T., Muthusamy, Y. K., & Gopalakrishnan, M. (1989). Language identification with neural networks: a feasibility study. In Proc. IEEE pacific rim conf. communications, computers and signal processing (pp. 525–529).
Chapter Google Scholar
Corredor-Ardoy, C., Gauvain, J., Adda-Decker, M., & Lamel, L. (1997). Language identification with language-independent acoustic models. In Proc. EUROSPEECH-1997 (pp. 55–58).
Google Scholar
Cummins, F., Gers, F., & Schmidhuber, J. (1999). Comparing prosody across languages. Tech. rep. I. D. S. I. A. Technical report IDSIA-07-99, Istituto Molle di Studie sull’Intelligenza Artificiale, CH6900 Lugano, Switzerland.
Cutler, A., & Ladd, D. R. (1983). Prosody: models and measurements. Berlin: Springer.
Book Google Scholar
Dalsgaard, P., & Andersen, O. (1992). Identification of mono- and polyphonemes using acoustic-phonetic features derived by a self-organising neural network. In Proc. int. conf. spoken language processing (ICSLP- 1992) (pp. 547–550).
Google Scholar
Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dordrecht: Kluwer Academic.
Book Google Scholar
Ember, M., & Ember, C. R. (1999). Cross-language predictors of consonant-vowel syllables. American Anthropologist, 101, 730–742.
Article Google Scholar
Gangashetty, S. V. (2005). Neural network models for recognition of consonant-vowel units of speech in multiple languages. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.
Gobbel, A. E. T., & Hutchins, S. E. (1996). On using prosodic cues in language identification. Proceedings of International Conference on Spoken Language Processing (ICSLP), 101, 1768–1772.
Article Google Scholar
Guoliang, Z., Fang, Z., & Zhanjiang, S. (2001). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(16), 582–589.
MATH Google Scholar
Gussenhoven, C., Reepp, B. H., Rietveld, A., Rump, H. H., & Terken, J. (1997). The perceptual prominence of fundamental frequency peaks. The Journal of the Acoustical Society of America, 102, 3009–3022.
Article Google Scholar
Hazen, T. J., & Zue, V. W. (1997). Segment-based automatic language identification. The Journal of the Acoustical Society of America, 101, 2323–2331.
Article Google Scholar
Ives, R. (1986). A minimal rule AI expert system for real-time classification of natural spoken languages. In Proc. 2nd annual artificial intelligence and advanced computer technology conf. (pp. 337–340).
Google Scholar
Jayaram, A. K. V. S., Ramasubramanian, V., & Sreenivas, T. V. (2003). Language identification using parallel sub-word recognition. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (Vol. I, pp. 32–35).
Google Scholar
Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). Hierarchical language identification system for Indian languages. Digital Signal Processing, 22, 544–553.
Article MathSciNet Google Scholar
Jyotsna, B., Murthy, H. A., & Nagarajan, T. (2000). Language identification from short segments of speech. In Proceedings of int. conf. spoken language processing, Beijing, China, Oct. 2000 (pp. 1033–1036).
Google Scholar
Koolagudi, S. G., & Sreenivasa Rao, K. (2012). Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. International Journal of Speech Technology, 15(3), 495–511.
Article Google Scholar
Krakow, R. A. (1999). Physiological organization of syllables: a review. Journal of Phonetics, 27, 23–54.
Article Google Scholar
Kumar Vuppala, A., & Sreenivasa Rao, K. (2013). Vowel onset point detection for noisy speech using spectral energy at formant frequencies. International Journal of Speech Technology, 16(2), 229–235.
Article Google Scholar
Kumar Vuppala, A., Yadav, J., Chakrabarti, S., & Sreenivasa Rao, K. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20, 1894–1903.
Article Google Scholar
Lamel, L. F., & Gauvain, J. L. (1994). Language identification using phonebased acoustic likelihoods. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, Apr. 1994 (Vol. 1, pp. 293–296).
Google Scholar
Lander, T., Cole, R., Oshika, B., & Noel, M. (1995). The OGI 22 language telephone speech corpus. In Proc. EUROSPEECH-1995 (pp. 817–820).
Google Scholar
Lavanya, P., Kishore, P., & Madhavi, G. (2005). A simple approach for building transliteration editors for Indian languages. Journal of Zhejiang University. Science, 6A(11), 1354–1361.
Article Google Scholar
LDC (1996). (LDC96S46 LDC96S60) Philadelphia, PA. http://www.ldc.upenn.edu/Catalog.
Leonard, R. G., & Doddington, G. R. (1974). Automatic language identification. Tech. Rep., A.F.R.A.D. Centre Tech. Rep. RADC-TR-74-200.
Lin, C. Y., & Wang, H. C. (2006). Language identification using pitch contour information in the ergodic Markov model. In Proc. 2006 IEEE int. conf. acoustics, speech and signal processing (ICASSP 2006).
Google Scholar
Lu-Feng, Z., Man-hung, S., Xi, Y., & Gish, H. (2006). Discriminatively trained language models using support vector machines for language identification. In Proc. speaker and language recognition workshop, IEEE odyssey 2006 (pp. 1–6).
Google Scholar
MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavial and Brain Sciences, 21, 499–546.
Google Scholar
Mahadeva Prasanna, S. R., Gangashetty, S. V., & Yegnanarayana, B. (2001). Significance of vowel onset point for speech analysis. In Proc. int. conf. signal processing and communication, Bangalore, India, Jul. 2001 (Vol. 1, pp. 81–86).
Google Scholar
Maity, S., Kumar Vuppala, A., Sreenivasa Rao, K., & Nandi, D. (2012). IITKGP-MLILSC speech database for language identification. In National conference on communications (NCC), Kharagpur, India, Feb. 2012 (pp. 1–3). New York: IEEE Press.
Chapter Google Scholar
Man-Hung, S., Xi, Y., & Gish, H. (2009). Discriminatively trained GMMs for language classification using boosting methods. IEEE Transactions on Audio, Speech, and Language Processing, 17(1), 187–197.
Article Google Scholar
Mart´ınez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In ICASSP.
Google Scholar
Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.
Mary, L., & Yegnanarayana, B. (2004). Autoassociative neural network models for language identification. In Proc. int. conf. intelligent sensing and information processing, Chennai, India (pp. 317–320).
Google Scholar
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.
Article Google Scholar
Mary, L., Rao, K. S., & Yegnanarayana, B. (2005). Neural network classifiers for language identification using syntactic and prosodic features. In Proc. IEEE int. conf. intelligent sensing and information processing, Chennai, India, Jan. 2005 (pp. 404–408).
Google Scholar
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.
Article Google Scholar
Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Proceedings of int. conf. spoken language processing (pp. 895–898).
Google Scholar
Nagarajan, T., & Murthy, H. A. (2002). Language identification using spectral vector distribution across languages. In Proc. international conference on natural language processing (pp. 327–335).
Google Scholar
Nakagawa, S., Ueda, Y., & Seino, T. (1992). Speaker-independent, text independent language identification by HMM. In Proc. int. conf. spoken language processing (ICSLP-1992) (pp. 1011–1014).
Google Scholar
Navratil, J. (2001). Spoken language recognition a step toward multilinguality in speech processing. IEEE Transactions on Speech and Audio Processing, 9(6), 678–685.
Article Google Scholar
Nayeemulla Khan, A., Gangashetty, S. V., & Yegnanarayana, B. (2003). Syllabic properties of three Indian languages: implications for speech recognition and language identification. In Proc. int. conf. natural language processing, Mysore, India, Dec. 2003 (pp. 125–134).
Google Scholar
Ohman, S. E. G. (1966). Coarticulation in VCV utterances: spectrographic measurements. The Journal of the Acoustical Society of America, 39, 151–168.
Article Google Scholar
Pellegrino, F., & Andre-Abrecht, R. (1999). An unsupervised approach to language identification. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (pp. 833–836).
Google Scholar
Pellegrino, F., Farinas, J., & André-Obrecht, R. (1999). Comparison of two phonetic approaches to language identification. In Proc. EUROSPEECH’99 (pp. 399–402).
Google Scholar
Ramasubramanian, V., Sai Jayaram, A. K. V., & Sreenivas, T. V. (2003). Language identification using parallel phone recognition. In WSLP, TIFR, Mumbai, Jan. 2003 (pp. 109–116).
Google Scholar
Ramus, F., & Mehler, J. (1999). Language identification with suprasegmental cues: a study based on speech resynthesis. The Journal of the Acoustical Society of America, 105, 512–521.
Article Google Scholar
Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in speech signal. Cognition, 73(3), 265–292.
Article Google Scholar
Rao, K. S. (2005). Acquisition and incorporation prosody knowledge for speech systems in Indian languages. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, May 2005.
Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech & Language, 24(1), 474–494.
Article Google Scholar
Rao, K.S. (2012). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33.
Article Google Scholar
Rao, K. S., & Vuppala, A. K. (2013). Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Communication, 55, 745–756.
Article Google Scholar
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980.
Article Google Scholar
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech & Language, 21, 282–295.
Article Google Scholar
Rao, K. S., & Yegnanarayana, B. (2009a). Intonation modeling for Indian languages. In International conference on spoken language processing (ICSLP) (pp. 733–736).
Google Scholar
Rao, K. S., & Yegnanarayana, B. (2009b). Intonation modeling for Indian languages. Computer Speech & Language, 23(2), 240–256.
Article Google Scholar
Rao, K. S., & Yegnanarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51, 1263–1269.
Article Google Scholar
Rao, K. S., Vuppala, A. K. & Chakrabarti, S. (2012). Spotting and recognition of consonant-vowel units from continuous speech using accurate vowel onset points. Circuits, Systems, and Signal Processing, 31(4), 1459–1474.
Article Google Scholar
Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17(1–2), 91–108.
Article Google Scholar
Riek, L., Mistreta, W., & Morgan, D. (1991). Experiments in language identification. Tech. Rep., Lockheed Sanders Tech. Rep. SPCOT-91-002.
Rouas, J. L. (2007). Automatic prosodic variations modeling for language and dialect discrimination. IEEE Transactions on Audio, Speech, and Language Processing, 15, 1904–1911.
Article Google Scholar
Rouas, J.-L., Farinas, J., Pellegrino, F., & André-Obrecht, R. (2005). Rhythmic unit extraction and modelling for automatic language identification. Speech Communication, 47, 436–456.
Article Google Scholar
Sangwan, A., Mehrabani, M., & Hansen, J. H. L. (2010). Automatic language analysis and identification based on speech production knowledge. In ICASSP.
Google Scholar
Sekhar, C. C. (1996). Neural network models for recognition of stop consonant-vowel (SCV) segments in continuous speech. Ph.D. thesis, Indian Institute of Technology Madras, Department of Computer Science and Engg, Chennai, India.
Shriberg, E., Stolcke, A., Hakkani-Tur, D., & Tur, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32, 127–154.
Article Google Scholar
Sreenivasa Rao, K., Maity, S., & Ramu Reddy, V. (2013). Pitch synchronous and glottal closure based speech analysis for language recognition. International Journal of Speech Technology. doi:10.1007/s10772-013-9193-5.
Google Scholar
Taylor, P. (2000). Analysis and synthesis of intonation using the tilt model. The Journal of the Acoustical Society of America, 107, 1697–1714.
Article Google Scholar
Ueda, Y., & Nakagawa, S. (1990). Diction for phoneme/syllable/word-category and identification of language using HMM. In Proc. int. conf. spoken language processing (ICSLP-1990) (pp. 1209–1212).
Google Scholar
Wong, E., & Sridharan, S. (2002). Gaussian mixture model based language identification system. In Proc. int. conf. spoken language processing (ICSLP-2002) (pp. 93–96).
Google Scholar
Xu, Y. (1998). Consistency of tone-syllable alignment across different syllable structures and speaking rates. Phonetica, 55, 179–203.
Article Google Scholar
Zissman, M. A. (1993). Automatic langauge identification using Gaussian mixture and hidden Markov models. In Proceedings of IEEE int. conf. acoust., speech, and signal processing, Apr. 1993 (pp. 399–402).
Chapter Google Scholar
Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on Speech and Audio Processing, 4, 31–44.
Article Google Scholar

Download references

Author information

Authors and Affiliations

TCS Innovation Labs, Kolkata, 700091, West Bengal, India
V. Ramu Reddy
School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
Sudhamay Maity & K. Sreenivasa Rao

Authors

V. Ramu Reddy
View author publications
You can also search for this author inPubMed Google Scholar
Sudhamay Maity
View author publications
You can also search for this author inPubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to K. Sreenivasa Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramu Reddy, V., Maity, S. & Sreenivasa Rao, K. Identification of Indian languages using multi-level spectral and prosodic features. Int J Speech Technol 16, 489–511 (2013). https://doi.org/10.1007/s10772-013-9198-0

Download citation

Received: 25 November 2012
Accepted: 18 May 2013
Published: 31 May 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s10772-013-9198-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identification of Indian languages using multi-level spectral and prosodic features

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Acoustic Feature Analysis and Discriminative Modeling for Language Identification of Closely Related South-Asian Languages

Spoken Language Identification of Indian Languages Using MFCC Features

A Pre-classification-Based Language Identification for Northeast Indian Languages Using Prosody and Spectral Features

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now