Abstract
Telugu is one of the oldest and popular languages of India, spoken by more than 66 million people, especially in South India. Not much work has been reported on the development of optical character recognition (OCR) systems for Telugu text. Therefore, it is an area of current research. Some characters in Telugu are made up of more than one connected symbol. Compound characters are written by associating modifiers with consonants, resulting in a huge number of possible combinations, running into hundreds of thousands. A compound character may contain one or more connected symbols. Therefore, systems developed for documents of other scripts, like Roman, cannot be used directly for the Telugu language.
The individual connected portions of a character or a compound character are defined as basic symbols in this paper and treated as a unit of recognition. The algorithms designed exploit special characteristics of Telugu script for processing the document images efficiently. The algorithms have been implemented to create a Telugu OCR system for printed text (TOSP). The output of TOSP is in phonetic English that can be transliterated to generate editable Telugu text. A special feature of TOSP is that it is designed to handle a large variety of sizes and multiple fonts, and still provides raw OCR accuracy of nearly 98%. The phonetic English representation can be also used to develop a Telugu text-to-speech system; work is in progress in this regard.
Similar content being viewed by others
References
Nagy G (2000) Twenty years of document image analysis in PAMI. IEEE T Pattern Anal 22(1):38–63
Mori S, Suen CY, Yamamoto K (1992) Historical review of OCR research and development. P IEEE 80(7):1029–1058
Govindan VK, Shivaprasad AP (1990) Character recognition: a review. Pattern Recogn 23(7):671–683
Bansal V, Sinha RMK (2001) A survey of OCR in Indian languages and a Devanagari OCR scheme. In: Proceedings of the symposium on translation support systems (STRANS-2001), Kanpur, India, February 2001
Chaudhuri BB, Pal U (1998) A complete printed Bangla OCR system. Pattern Recogn, 31:531–549
Nagabhushan P, Radhika A (1997) Improved region decomposition method for the recognition of non-uniform sized characters. In: Proceedings of the 1st international conference on cognitive science , Seoul, Korea, August 1997 1:36–42
Anna Durai S et al (1995) Tamil character recognition using multilayer neural network. In: Proceedings of the Indian conference on pattern recognition, image processing and computer vision, Kharagpur, India, December 1995, pp 155–160
Bishnu A, Chaudhuri B (1999) Segmentation of Bangla handwritten text into characters by recursive contour following. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 402–405
Pal U, Chaudhuri B (1999) Script line separation from Indian multi-script documents. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 406–409
Bansal V, Sinha R (1999) On how to describe shapes of Devanagari characters and use them for recognition. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 410–413
Anatani S, Agnihotri L (1999) Gujarati character recognition. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 418–421
Sundaresan C, Keerthi S (1999) A study of representation for pen based handwriting recognition of Tamil characters. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 422 – 425.
Sukhaswami MB, Seetharamulu P, Pujari AK (1995) Recognition of Telugu characters using neural networks. Int J Neural Syst, September, 1995, 6(3):317–357
Negi A, Bhagvati C, Krishna B (2001) An OCR system for Telugu. In: Proceedings of the international conference on document analysis and recognition (ICDAR 2001), Seattle, Washington, September 2001
Casey RG, Lecolinet E (1996) A survey of methods and strategies in character segmentation. IEEE T Pattern Anal 18:690 –706
Pavilidis T, Zhou J (1992) Page segmentation and classification. Computer Vision Graph 54:484–496
Akiyama T, Hagita N (1990) Automatic entry system for printed documents. Pattern Recogn 23:1141–1154
Le DS, Thoma GR, Wechsler H (1994) Automatic page orientation and skew angle detection for binary document images. Pattern Recogn 27:1325–1344
Sonka M, Hlavac V, Boyle R (1998) Image processing, analysis, and machine vision, 2nd edn. PWS, New York
Yan H (1993) Skew detection of document images using interline cross-correlation. CVGIP–Graph Model Im 55:538–543
Srikanthan G, Lam SW, Srihari SN (1996) Gradient-based contour encoding for character recognition. Pattern Recogn 29(7):1147–1160
Fausett L (1994) Fundamentals of neural networks. Prentice Hall, Englewood Cliffs, New Jersey
Vasantha Lakshmi C (2003) PhD thesis (unpublished), Dayalbagh Educational Institute, Agra, India
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Table A
Rights and permissions
About this article
Cite this article
Lakshmi, C.V., Patvardhan, C. An optical character recognition system for printed Telugu text. Pattern Anal Applic 7, 190–204 (2004). https://doi.org/10.1007/s10044-004-0217-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-004-0217-2