Skip to main content

Language, Script, and Font Recognition

  • Reference work entry
  • First Online:
Handbook of Document Image Processing and Recognition

Abstract

Automatic identification of a language within a text document containing multiple scripts and fonts is a challenging task, as it is not only linked with the shape, size, and style of the characters and symbols used in the formation of the text but also admixed with more crucial factors such as the forms and size of pages, layout of written text, spacing between text lines, design of characters, density of information, directionality of text composition, etc. Therefore, successful management of the various types of information in the act of character, script, and language recognition requires an intelligent system that can elegantly deal with all these factors and issues along with other secondary factors such as language identity, writing system, ethnicity, anthropology, etc. Due to such complexities, identification of script vis-à-vis language has been a real challenge in optical character recognition (OCR) and information retrieval technology. Considering the global upsurge of so-called minor and/or unknown languages, it has become a technological challenge to develop automatic or semiautomatic systems that can identify a language vis-à-vis a script in which a particular piece of text document is composed. Bearing these issues in mind, an attempt is initiated in this chapter to address some of the methods and approaches developed so far for language, script, and font recognition for written text documents. The first section, after presenting a general overview of language, deals with the information about the origin of language, the difficulties faced in language identification, and the existing approaches to language identification. The second section presents an overview of script, differentiates between single- and multiscript documents, describes script identification technologies and the challenges involved therein, focuses on the process of machine-printed script identification, and then addresses the issues involved in handwritten script identification. The third section tries to define font terminologies, addresses the problems involved in font generation, refers to the phenomenon of font variation in a language, and discusses strategies for font and style recognition. Thus, the chapter depicts a panoramic portrait of the three basic components involved in OCR technology: the problems and issues involved, the milestones achieved so far, and the challenges that still lie ahead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 549.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 549.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Greenberg J (1963) Universals of languages. MIT, Cambridge

    Google Scholar 

  2. Hockett CF (1960) The origin of speech. Sci Am 203:89–97

    Article  Google Scholar 

  3. Skinner BF (1953) Science and human behavior. Macmillan, New York

    Google Scholar 

  4. Chomsky AN (1959) On certain formal properties of grammars. Inf Control 2:137–167

    Article  MathSciNet  Google Scholar 

  5. Pinker S, Bloom P (1990) Natural language and natural selection. Behav Brain Sci 13(4):707–784

    Article  Google Scholar 

  6. Pinker S (1995) The language instinct: the new science of language and mind. Penguin Books, Middlesex

    Google Scholar 

  7. Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27:1720–1732

    Article  Google Scholar 

  8. Nakayama T, Spitz AL (1993) European language determination from image. In: Proceedings of the 2nd international conference on document analysis and recognition, Tsukuba, pp 159–162

    Google Scholar 

  9. Hochberg J, Bowers K, Cannon M, Kelly P (1999) Script and language identification for handwritten document images. Int J Doc Anal Recognit 2:45–52

    Article  Google Scholar 

  10. Sibun P, Spitz AL (1994) Language determination: natural language processing from scanned document images. In: Proceedings of the applied natural language processing, Stuttgart, pp 15–21

    Google Scholar 

  11. Beesley KR (1988) Language identifier: a computer program for automatic natural-language identification of on-line text. In: Language at crossroads: proceedings of the 29th annual conference of the American Translators Association, Seattle, pp 47–54

    Google Scholar 

  12. Cavner WB, Trenkle JM (1994) N-gram based text categorization. In: Proceedings of the third annual symposium of document analysis and information retrieval, Las Vegas, pp 161–169

    Google Scholar 

  13. Cole RA, Mariani J, Uszkoreit H, Zaenen A, Zue V (eds) (1997) Survey of the state of the art in human language technology. Cambridge University Press, Cambridge

    Google Scholar 

  14. Hays J (1993) Language identification using two and three-letter cluster. Technical report, School of Computer Studies, University of Leeds

    Google Scholar 

  15. Ingle NC (1976) A language identification table. Inc Linguist 15:98–101

    Google Scholar 

  16. Mathusamy YK, Barnard E (1994) Reviewing automatic language identification. IEEE Signal Process Mag 11:33–41

    Article  Google Scholar 

  17. Souter C, Churcher G, Hayes J, Hughes J, Johnson S (1994) Natural language identification using corpus-based models. In: Lauridsen K, Lauridsen O (guest eds) Hermes J Linguist 13:183–203

    Google Scholar 

  18. Majumder P, Mitra M, Chaudhuri BB (2002) N-gram: a language independent approach to IR and NLP. In: Proceedings of the international conference on universal knowledge and language, Goa, 25–29 Nov 2002

    Google Scholar 

  19. Dash NS (2011) A descriptive study of the modern Bengali script. Lambert Academic: Saarbrucken

    Google Scholar 

  20. Ghosh D, Dube T, Shivaprasad AP (2010) Script recognition-a review. IEEE Trans PAMI 32(12):2142–2161

    Article  Google Scholar 

  21. Pal U, Roy PP, Tripathy N, Llados J (2010) Multi-oriented Bangla and Devnagari text recognition. Pattern Recognit 43:4124–4136

    Article  Google Scholar 

  22. Spitz L (1990) Multilingual document recognition. In: Furuta R (ed) Electronic publishing, document manipulation, and typography. Cambridge University Press, Cambridge/ New York/Melbourne, pp 193–206

    Google Scholar 

  23. Spitz AL (1994) Text characterization by connected component transformation. In: Proceedings of SPIE, document recognition, San Jose, vol 2181, pp 97–105

    Article  Google Scholar 

  24. Spitz L (1997) Determination of the script and language content of document images. IEEE Trans Pattern Anal Mach Intell 19:235–245

    Article  Google Scholar 

  25. Hochberg J, Kelly P, Thomas T, Kerns L (1997) Automatic script identification from document images using cluster-based templates. IEEE Trans Pattern Anal Mach Intell 19:176–181

    Article  Google Scholar 

  26. Hochberg J, Kerns L, Kelly P, Thomas T (1995) Automatic script identification from images using cluster-based templates. In: Proceedings of the 3rd international conference on document analysis and recognition, Montreal, pp 378–381

    Google Scholar 

  27. Tan TN (1998) Rotation invariant texture features and their use in automatic script identification. IEEE Trans Pattern Anal Mach Intell 20:751–756

    Article  Google Scholar 

  28. Peake GS, Tan TN (1997) Script and language identification from document images. In: Proc. Eighth British Mach. Vision Conf., Essex, UK, vol 2, pp 230–233

    Google Scholar 

  29. Pam WM, Suen CY, Bui T (2005) Script identification using steerable Gabor filters. In: Proceedings of the 8th international conference on document analysis and recognition, Seoul, pp 883–887

    Google Scholar 

  30. Ding J, Lam L, Suen CY (1997) Classification of oriental and European scripts by using characteristic features. In: Proceedings of the 4th international conference on document analysis and recognition, Ulm, pp 1023–1027

    Google Scholar 

  31. Zhang T, Ding X (1999) Cluster-based bilingual script-segmentation and language identification. In: Character recognition and intelligent information processing, Tsinghua University, China, vol 6, pp 137–148

    Google Scholar 

  32. Lee DS, Nohl CR, Baird HS (1996) Language identification in complex, un-oriented and degraded document images. In: Proceedings of the IAPR workshop on document analysis and systems, Malvern, pp 76–98

    Google Scholar 

  33. Ablavsky V, Stevens M (2003) Automatic feature selection with applications to script identification of degraded documents. In: Proceedings of the 7th international conference on document analysis and recognition, Edinburgh, pp 750–754

    Google Scholar 

  34. Elgammal M, Ismail MA (2001) Techniques for language identification for hybrid Arabic–English document images. In: Proceedings of the 6th international conference on document analysis and recognition, Seattle, pp. 1100–1104

    Google Scholar 

  35. Wood S, Yao X, Krishnamurthi K, Dang L (1995) Language identification for printed text independent of segmentation. In: Proceedings of the international conference on image processing, Washington, DC, pp 428–431

    Google Scholar 

  36. Tao Y, Tang YY (2001) Discrimination of oriental and Euramerican scripts using fractal feature. In: Proceedings of the 6th international conference on document analysis and recognition, Seattle, pp 1115–1119

    Google Scholar 

  37. Pal U, Chaudhuri BB (1999) Script line separation from Indian multi-script documents. In: Proceedings of the 5th international conference on document analysis and recognition, Bangalore, pp 406–409

    Google Scholar 

  38. Pal U, Sinha S, Chaudhuri BB (2003) Multi-script line identification from Indian documents. In: Proceedings of the 7th international conference on document analysis and recognition, Edinburgh, pp 880–884

    Google Scholar 

  39. Chaudhuri S, Sheth R (1999) Trainable script identification strategies for Indian languages. In: Proceedings of the fifth international conference on document analysis and recognition, Bangalore, pp 657–660

    Google Scholar 

  40. Pal U, Chaudhuri BB (2001) Automatic identification Of English, Chinese, Arabic, Devnagari and Bangla script line. In: Proceedings of the sixth international conference on document analysis and recognition, Seattle, pp 790–794

    Google Scholar 

  41. Wang KY, Casey RG, Wahl FM (1982) Document analysis system. IBM J Res Dev 26:647–656

    Article  Google Scholar 

  42. Pal U, Roy PP (2004) Multi-oriented and curved text lines extraction from Indian documents. IEEE Trans Syst Man Cybern B 34:1676–1684

    Article  Google Scholar 

  43. Pal U, Chaudhuri BB (1997) Automatic separation of words in Indian multi-lingual multi-script documents. In: Proceedings of the fourth international conference on document analysis and recognition, Ulm, pp 576–579

    Google Scholar 

  44. Patil SB, Subbareddy NV (2002) Neural network based system for script identification in Indian documents. Sadhana 27:83–97

    Article  Google Scholar 

  45. Dhanya D, Ramakrishna AG, Pati PB (2002) Script identification in printed bilingual documents. Sadhana 27:73–82

    Article  Google Scholar 

  46. Dhanya D, Ramakrishna AG (2002) Script identification in printed bilingual documents. In: Proceedings of the document analysis and systems, Princeton, pp 13–24

    Google Scholar 

  47. Manthalkar R, Biswas PK, An automatic script identification scheme for Indian languages. www.ee.iitb.ac.in/uma/~ncc2002/proc/NCC-2002/pdf/n028.pdf.

  48. Mantas J (1986) An overview of character recognition methodologies. Pattern Recognit 19:425–430

    Article  Google Scholar 

  49. Roy K, Pal U, Chaudhuri BB (2005) A system for neural network based word-wise handwritten script identification for Indian postal automation. In: Second international conference on intelligent sensing and information processing and control, Mysore, pp 581–586

    Google Scholar 

  50. Lam L, Ding J, Suen CY (1998) Differentiating between oriental and European scripts by statistical features. Int J Pattern Recognit Artif Intell 12(1):63–79

    Article  Google Scholar 

  51. Hochberg J, Cannon M, Kelly P, White J (1997) Page segmentation using script identification vectors: a first look. In: Proceedings of the 1997 symposium on document image understanding technology, Annapolis, pp 258–264

    Google Scholar 

  52. Pal U, Chaudhuri BB (2002) Identification of different script lines from multi-script documents. Image Vis Comput 20(13–14):945–954

    Article  Google Scholar 

  53. Padma MC, Nagabhushan P (2003) Identification and separation of text words of Kannada, Hindi and English languages through discriminating features. In: Proceedings of the 2nd Indian conference on document analysis and recognition, Mysore, India, pp 252–260

    Google Scholar 

  54. Jawahar CV, Pavan Kumar MNSSK, Ravi Kiran SS (2003) A bilingual OCR for Hindi–Telugu documents and its applications. In: Proceedings of the international conference on document analysis and recognition, Edinburgh, Aug 2003 pp 408–412

    Google Scholar 

  55. Chanda S, Sinha S, Pal U (2004) Word-wise English Devnagari and Oriya script identification. In: Sinha RMK, Shukla VN (eds) Speech and language systems for human communication. Tata McGraw-Hill, New Delhi, pp. 244–248

    Google Scholar 

  56. Jain AK, Zhong Y (1996) Page segmentation using texture analysis. Pattern Recognit 29(5):743–770

    Article  Google Scholar 

  57. Singhal V, Navin N, Ghosh D (2003) Script-based classification of handwritten text documents in a multilingual environment. In: Proceedings of the 13th international workshop on research issues in data engineering–multilingual information management, Hyderabad, pp 47–53

    Google Scholar 

  58. Joshi GD, Garg S, Sivaswamy J (2006) Script identification from Indian documents. In: Proceedings of the IAPR international workshop document analysis systems, Feb 2006, Nelson, New Zealand, pp 255–267

    Chapter  Google Scholar 

  59. Ma H, Doermann D (2003) Gabor filter based multi-class classifier for scanned document images. In: Proceedings of the international conference on document analysis and recognition, Aug 2003, Edinburgh, Scotland, pp 968–972

    Google Scholar 

  60. Jaeger S, Ma H, Doermann D (2005) Identifying script on word-level with informational confidence. In: Proceedings of the international conference on document analysis and recognition, Aug/Sept 2005, vol 1, pp 416–420

    Google Scholar 

  61. Chanda S, Pal U, Terrades OR (2009) Word-wise Thai and Roman script identification. ACM Trans Asian Lang Inf Process 8(11):1–21

    Article  Google Scholar 

  62. Zhou L, Lu Y, Tan CL (2006) Bangla/English script identification based on analysis of connected component profiles. In: Proceedings of the 7th international workshop on document analysis and systems, Nelson, pp 243–254

    Google Scholar 

  63. Dhandra BV, Nagabhushan P, Hangarge M, Hegadi R, Malemath VS (2006) Script identification based on morphological reconstruction in document images. In: Proceedings of the international conference on pattern recognition (ICPR’06), Hong Kong, pp 950–953

    Google Scholar 

  64. Hou HS (1983) Digital document processing. Wiley, New York

    Google Scholar 

  65. Zramdini A, Ingold R (1993) Optical font recognition from projection profiles. Electron Publi 6(3):249–260

    Google Scholar 

  66. Zhu Y, Tan T, Wang Y (2001) Font recognition based on global texture analysis. IEEE Trans PAMI 23(10):1192–1200

    Article  Google Scholar 

  67. Zramdini A, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans PAMI 20(8):887–882

    Article  Google Scholar 

  68. Khoubyari S, Hull JJ (1996) Font and function word identification in document recognition. Comput Vis Image Underst 63(1):66–74

    Article  Google Scholar 

  69. Jeong CB, Kwag HK, Kim SH, Kim JS, Park SC (2003) Identification of font styles and typefaces in printed Korean documents. In: Sembok TMT et al (eds) ICADL 2003, Kuala Lumpur. LNCS 2911, pp 666–669

    Chapter  Google Scholar 

  70. http://en.wikipedia.org/wiki/Languages_of_India

  71. Sharma N., Chanda S, Pal U, Blumenstein U (2013) Word-wise script identification from video frames, 12th International conference on document analysis and recognition, Washington DC, USA pp 867–871

    Google Scholar 

  72. Morris RA (1988) Classification of digital typefaces using spectral signatures. Pattern Recognit 25(8):869–876

    Article  Google Scholar 

  73. Anigbogu JCh (1992) Reconnaissance de Textes Imprimés Multifontes à l’Aide de Modèles Stochastiques et Métriques. Ph.D. dissertation, Université de Nancy I

    Google Scholar 

  74. Abuhaiba ISI (2003) Arabic font recognition based on templates. Int Arab J Inf Technol 1:33–39

    Google Scholar 

  75. http://www.ntchosting.com/multimedia/font.html

  76. Chanda S, Pal U, Franke K (2012) Font identification: in context of an Indic script. In: Proceedings of the 21st international conference on pattern recognition (ICPR), Tsukuba, pp 1655–1658

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Umapada Pal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag London

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Pal, U., Dash, N.S. (2014). Language, Script, and Font Recognition. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_9

Download citation

Publish with us

Policies and ethics