Abstract
A morphologically rich language has hundreds of forms of each word which makes storing and maintaining them time and resource consuming. It also leads to confusions while recognizing speech which leads to more word error rate. These issues make it difficult to build applications of speech recognition for such languages. Hence there is a need to develop a phonetically balanced minimal data set. This paper describes generating minimum dataset for Telugu language, the second most widely spoken language in India. Considering minimum data generation as a set covering problem, a variety of datasets are generated based on different criteria. From various set covering algorithms, Greedy algorithm is chosen. The criterion used for final data selection is the frequency of occurrence of words. As set covering requires a large set of data from which minimum data is selected, a 15 Million word text corpus has been created. Thorough analysis of this text corpus is carried out in order to ensure that the generated set is phonetically balanced. The generated minimum dataset consists of 21 words and covers each phoneme of the Telugu language. Telugu speech technology researchers can benefit from this data set in building applications of phoneme level speech recognition by reducing manual recording effort and time. This paper discusses the role of minimum data set in LVSR systems, details of the text corpus created and proposed algorithm for minimum data generation.
Similar content being viewed by others
References
Agrawal S. S. (2010). Recent developments in speech corpora in indian languages: Country Report of India, O-COCOSDA, Kathmandu.
Antal, M. (2007). Toward a simple phoneme based speech recognition system. Studia Universitatis Babes, Bolyai, Informatica, LII(2), 33.
Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.
Beun, D., Pols, L., & Kloosterman, H. (1995). Phoneme-based automatic speech recognition: towards a demonstrator for information retrieval, using dutch hi-fi speech. In: Proceedings in institute of phonetic sciences, University of Amsterdam (Vol. 19, pp. 126–134).
Bharathi, A., Prakash Rao, K., Sangal, R., & Bendre, S.M. (2002). Basic statistical analysis of corpus and cross coparision among corpora. Technical Report 4, IIIT, Hyderabad, www.iiit.net/techreports/2002.4.pdf.
Emeneau, M. B. (1946). The phonemes of Sanskrit language.
Gopalakrishna, A., et al. (2005, October). Development of indian language speech databases for large vocabulary speech recognition systems. In: Proceedings of international conference on speech and computer (SPECOM), Patras, Greece.
Jagannath. (1981). Telugu loanword phonology, Ph.D Thesis, University of Arizona.
Khan, A. N., Gangashetty, S. V. & Yegnanarayana, B. (2003). Syllabic properties of three Indian languages: Implications for speech recognition and language identification. In: International Conference Natural Language Processing (pp. 125–134).
Kostić, D., Mitter, A., & Krishnamurti, B. (1997). A short outline of Telugu phonetics. Calcutta: Indian Statistical Institute.
Krishnamurthy, N. D. (1992). Conversational Telugu. Bangalore: N.D.K.Institute of Languages.
Nagamma Reddy, K. (1995). Phonetic, Phonological, morpho-syntactic and semantic functions of segmental duration in spoken Telugu: Acoustic evidence.
Neti, C., Rajput, N., & Verma, A. (2002). A large vocabulary continuous speech recognition system for Hindi. In: Proceedings of the national conference on communications, Mumbai (pp. 366–370).
Rao, C. R. (1965). A grammatical sketch of Telugu, an artcicle published in 1965.
Rao, U. (2004). Materials for a computational grammar for telugu, phonology and Morphology, Vol 1.
Reddy, B. R. (1976). Localist studies in Telugu syntax, Ph.D Thesis, University of Edinburgh.
Schiffman, H. F., & Eastman, C. (1975). Dravidian phonological systems. London: University of Washington Press. ISBN-13: 9780295955070.
Sunitha, K. V. N., & Sharada, A. (2009). Telugu text corpora analysis for creating speech database. International Journal of Engineering & Information Technology, 1(2), 109–114. ISSN: 0975–5292.
Young, S., & Bloothooft, G. (eds.), (1997). Corpus-based methods in language and speech precessing, Vol-II. Dordrecht: Kluwer Academic Publishers.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sunitha, K.V.N., Sharada, A. Minimum data generation for Telugu speech recognition. Int J Speech Technol 18, 217–230 (2015). https://doi.org/10.1007/s10772-014-9262-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-014-9262-4