Skip to main content
Log in

An optical character recognition system for printed Telugu text

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Telugu is one of the oldest and popular languages of India, spoken by more than 66 million people, especially in South India. Not much work has been reported on the development of optical character recognition (OCR) systems for Telugu text. Therefore, it is an area of current research. Some characters in Telugu are made up of more than one connected symbol. Compound characters are written by associating modifiers with consonants, resulting in a huge number of possible combinations, running into hundreds of thousands. A compound character may contain one or more connected symbols. Therefore, systems developed for documents of other scripts, like Roman, cannot be used directly for the Telugu language.

The individual connected portions of a character or a compound character are defined as basic symbols in this paper and treated as a unit of recognition. The algorithms designed exploit special characteristics of Telugu script for processing the document images efficiently. The algorithms have been implemented to create a Telugu OCR system for printed text (TOSP). The output of TOSP is in phonetic English that can be transliterated to generate editable Telugu text. A special feature of TOSP is that it is designed to handle a large variety of sizes and multiple fonts, and still provides raw OCR accuracy of nearly 98%. The phonetic English representation can be also used to develop a Telugu text-to-speech system; work is in progress in this regard.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Nagy G (2000) Twenty years of document image analysis in PAMI. IEEE T Pattern Anal 22(1):38–63

    Article  Google Scholar 

  2. Mori S, Suen CY, Yamamoto K (1992) Historical review of OCR research and development. P IEEE 80(7):1029–1058

    Article  Google Scholar 

  3. Govindan VK, Shivaprasad AP (1990) Character recognition: a review. Pattern Recogn 23(7):671–683

    Article  Google Scholar 

  4. Bansal V, Sinha RMK (2001) A survey of OCR in Indian languages and a Devanagari OCR scheme. In: Proceedings of the symposium on translation support systems (STRANS-2001), Kanpur, India, February 2001

  5. Chaudhuri BB, Pal U (1998) A complete printed Bangla OCR system. Pattern Recogn, 31:531–549

    Google Scholar 

  6. Nagabhushan P, Radhika A (1997) Improved region decomposition method for the recognition of non-uniform sized characters. In: Proceedings of the 1st international conference on cognitive science , Seoul, Korea, August 1997 1:36–42

  7. Anna Durai S et al (1995) Tamil character recognition using multilayer neural network. In: Proceedings of the Indian conference on pattern recognition, image processing and computer vision, Kharagpur, India, December 1995, pp 155–160

  8. Bishnu A, Chaudhuri B (1999) Segmentation of Bangla handwritten text into characters by recursive contour following. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 402–405

  9. Pal U, Chaudhuri B (1999) Script line separation from Indian multi-script documents. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 406–409

  10. Bansal V, Sinha R (1999) On how to describe shapes of Devanagari characters and use them for recognition. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 410–413

  11. Anatani S, Agnihotri L (1999) Gujarati character recognition. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 418–421

  12. Sundaresan C, Keerthi S (1999) A study of representation for pen based handwriting recognition of Tamil characters. In: Proceedings of the 5th international conference on document analysis and recognition (ICDAR’99), Bangalore, India, September 1999, pp 422 – 425.

  13. Sukhaswami MB, Seetharamulu P, Pujari AK (1995) Recognition of Telugu characters using neural networks. Int J Neural Syst, September, 1995, 6(3):317–357

    Google Scholar 

  14. Negi A, Bhagvati C, Krishna B (2001) An OCR system for Telugu. In: Proceedings of the international conference on document analysis and recognition (ICDAR 2001), Seattle, Washington, September 2001

  15. Casey RG, Lecolinet E (1996) A survey of methods and strategies in character segmentation. IEEE T Pattern Anal 18:690 –706

    Article  Google Scholar 

  16. Pavilidis T, Zhou J (1992) Page segmentation and classification. Computer Vision Graph 54:484–496

    Google Scholar 

  17. Akiyama T, Hagita N (1990) Automatic entry system for printed documents. Pattern Recogn 23:1141–1154

    Article  Google Scholar 

  18. Le DS, Thoma GR, Wechsler H (1994) Automatic page orientation and skew angle detection for binary document images. Pattern Recogn 27:1325–1344

    Article  Google Scholar 

  19. Sonka M, Hlavac V, Boyle R (1998) Image processing, analysis, and machine vision, 2nd edn. PWS, New York

  20. Yan H (1993) Skew detection of document images using interline cross-correlation. CVGIP–Graph Model Im 55:538–543

    Google Scholar 

  21. Srikanthan G, Lam SW, Srihari SN (1996) Gradient-based contour encoding for character recognition. Pattern Recogn 29(7):1147–1160

    Article  Google Scholar 

  22. Fausett L (1994) Fundamentals of neural networks. Prentice Hall, Englewood Cliffs, New Jersey

  23. Vasantha Lakshmi C (2003) PhD thesis (unpublished), Dayalbagh Educational Institute, Agra, India

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. Vasantha Lakshmi.

Appendix

Appendix

Table A

Table A Confusion table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lakshmi, C.V., Patvardhan, C. An optical character recognition system for printed Telugu text. Pattern Anal Applic 7, 190–204 (2004). https://doi.org/10.1007/s10044-004-0217-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-004-0217-2

Keywords

Navigation