Abstract
Automatic identification of a language within a text document containing multiple scripts and fonts is a challenging task, as it is not only linked with the shape, size, and style of the characters and symbols used in the formation of the text but also admixed with more crucial factors such as the forms and size of pages, layout of written text, spacing between text lines, design of characters, density of information, directionality of text composition, etc. Therefore, successful management of the various types of information in the act of character, script, and language recognition requires an intelligent system that can elegantly deal with all these factors and issues along with other secondary factors such as language identity, writing system, ethnicity, anthropology, etc. Due to such complexities, identification of script vis-à-vis language has been a real challenge in optical character recognition (OCR) and information retrieval technology. Considering the global upsurge of so-called minor and/or unknown languages, it has become a technological challenge to develop automatic or semiautomatic systems that can identify a language vis-à-vis a script in which a particular piece of text document is composed. Bearing these issues in mind, an attempt is initiated in this chapter to address some of the methods and approaches developed so far for language, script, and font recognition for written text documents. The first section, after presenting a general overview of language, deals with the information about the origin of language, the difficulties faced in language identification, and the existing approaches to language identification. The second section presents an overview of script, differentiates between single- and multiscript documents, describes script identification technologies and the challenges involved therein, focuses on the process of machine-printed script identification, and then addresses the issues involved in handwritten script identification. The third section tries to define font terminologies, addresses the problems involved in font generation, refers to the phenomenon of font variation in a language, and discusses strategies for font and style recognition. Thus, the chapter depicts a panoramic portrait of the three basic components involved in OCR technology: the problems and issues involved, the milestones achieved so far, and the challenges that still lie ahead.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Greenberg J (1963) Universals of languages. MIT, Cambridge
Hockett CF (1960) The origin of speech. Sci Am 203:89–97
Skinner BF (1953) Science and human behavior. Macmillan, New York
Chomsky AN (1959) On certain formal properties of grammars. Inf Control 2:137–167
Pinker S, Bloom P (1990) Natural language and natural selection. Behav Brain Sci 13(4):707–784
Pinker S (1995) The language instinct: the new science of language and mind. Penguin Books, Middlesex
Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27:1720–1732
Nakayama T, Spitz AL (1993) European language determination from image. In: Proceedings of the 2nd international conference on document analysis and recognition, Tsukuba, pp 159–162
Hochberg J, Bowers K, Cannon M, Kelly P (1999) Script and language identification for handwritten document images. Int J Doc Anal Recognit 2:45–52
Sibun P, Spitz AL (1994) Language determination: natural language processing from scanned document images. In: Proceedings of the applied natural language processing, Stuttgart, pp 15–21
Beesley KR (1988) Language identifier: a computer program for automatic natural-language identification of on-line text. In: Language at crossroads: proceedings of the 29th annual conference of the American Translators Association, Seattle, pp 47–54
Cavner WB, Trenkle JM (1994) N-gram based text categorization. In: Proceedings of the third annual symposium of document analysis and information retrieval, Las Vegas, pp 161–169
Cole RA, Mariani J, Uszkoreit H, Zaenen A, Zue V (eds) (1997) Survey of the state of the art in human language technology. Cambridge University Press, Cambridge
Hays J (1993) Language identification using two and three-letter cluster. Technical report, School of Computer Studies, University of Leeds
Ingle NC (1976) A language identification table. Inc Linguist 15:98–101
Mathusamy YK, Barnard E (1994) Reviewing automatic language identification. IEEE Signal Process Mag 11:33–41
Souter C, Churcher G, Hayes J, Hughes J, Johnson S (1994) Natural language identification using corpus-based models. In: Lauridsen K, Lauridsen O (guest eds) Hermes J Linguist 13:183–203
Majumder P, Mitra M, Chaudhuri BB (2002) N-gram: a language independent approach to IR and NLP. In: Proceedings of the international conference on universal knowledge and language, Goa, 25–29 Nov 2002
Dash NS (2011) A descriptive study of the modern Bengali script. Lambert Academic: Saarbrucken
Ghosh D, Dube T, Shivaprasad AP (2010) Script recognition-a review. IEEE Trans PAMI 32(12):2142–2161
Pal U, Roy PP, Tripathy N, Llados J (2010) Multi-oriented Bangla and Devnagari text recognition. Pattern Recognit 43:4124–4136
Spitz L (1990) Multilingual document recognition. In: Furuta R (ed) Electronic publishing, document manipulation, and typography. Cambridge University Press, Cambridge/ New York/Melbourne, pp 193–206
Spitz AL (1994) Text characterization by connected component transformation. In: Proceedings of SPIE, document recognition, San Jose, vol 2181, pp 97–105
Spitz L (1997) Determination of the script and language content of document images. IEEE Trans Pattern Anal Mach Intell 19:235–245
Hochberg J, Kelly P, Thomas T, Kerns L (1997) Automatic script identification from document images using cluster-based templates. IEEE Trans Pattern Anal Mach Intell 19:176–181
Hochberg J, Kerns L, Kelly P, Thomas T (1995) Automatic script identification from images using cluster-based templates. In: Proceedings of the 3rd international conference on document analysis and recognition, Montreal, pp 378–381
Tan TN (1998) Rotation invariant texture features and their use in automatic script identification. IEEE Trans Pattern Anal Mach Intell 20:751–756
Peake GS, Tan TN (1997) Script and language identification from document images. In: Proc. Eighth British Mach. Vision Conf., Essex, UK, vol 2, pp 230–233
Pam WM, Suen CY, Bui T (2005) Script identification using steerable Gabor filters. In: Proceedings of the 8th international conference on document analysis and recognition, Seoul, pp 883–887
Ding J, Lam L, Suen CY (1997) Classification of oriental and European scripts by using characteristic features. In: Proceedings of the 4th international conference on document analysis and recognition, Ulm, pp 1023–1027
Zhang T, Ding X (1999) Cluster-based bilingual script-segmentation and language identification. In: Character recognition and intelligent information processing, Tsinghua University, China, vol 6, pp 137–148
Lee DS, Nohl CR, Baird HS (1996) Language identification in complex, un-oriented and degraded document images. In: Proceedings of the IAPR workshop on document analysis and systems, Malvern, pp 76–98
Ablavsky V, Stevens M (2003) Automatic feature selection with applications to script identification of degraded documents. In: Proceedings of the 7th international conference on document analysis and recognition, Edinburgh, pp 750–754
Elgammal M, Ismail MA (2001) Techniques for language identification for hybrid Arabic–English document images. In: Proceedings of the 6th international conference on document analysis and recognition, Seattle, pp. 1100–1104
Wood S, Yao X, Krishnamurthi K, Dang L (1995) Language identification for printed text independent of segmentation. In: Proceedings of the international conference on image processing, Washington, DC, pp 428–431
Tao Y, Tang YY (2001) Discrimination of oriental and Euramerican scripts using fractal feature. In: Proceedings of the 6th international conference on document analysis and recognition, Seattle, pp 1115–1119
Pal U, Chaudhuri BB (1999) Script line separation from Indian multi-script documents. In: Proceedings of the 5th international conference on document analysis and recognition, Bangalore, pp 406–409
Pal U, Sinha S, Chaudhuri BB (2003) Multi-script line identification from Indian documents. In: Proceedings of the 7th international conference on document analysis and recognition, Edinburgh, pp 880–884
Chaudhuri S, Sheth R (1999) Trainable script identification strategies for Indian languages. In: Proceedings of the fifth international conference on document analysis and recognition, Bangalore, pp 657–660
Pal U, Chaudhuri BB (2001) Automatic identification Of English, Chinese, Arabic, Devnagari and Bangla script line. In: Proceedings of the sixth international conference on document analysis and recognition, Seattle, pp 790–794
Wang KY, Casey RG, Wahl FM (1982) Document analysis system. IBM J Res Dev 26:647–656
Pal U, Roy PP (2004) Multi-oriented and curved text lines extraction from Indian documents. IEEE Trans Syst Man Cybern B 34:1676–1684
Pal U, Chaudhuri BB (1997) Automatic separation of words in Indian multi-lingual multi-script documents. In: Proceedings of the fourth international conference on document analysis and recognition, Ulm, pp 576–579
Patil SB, Subbareddy NV (2002) Neural network based system for script identification in Indian documents. Sadhana 27:83–97
Dhanya D, Ramakrishna AG, Pati PB (2002) Script identification in printed bilingual documents. Sadhana 27:73–82
Dhanya D, Ramakrishna AG (2002) Script identification in printed bilingual documents. In: Proceedings of the document analysis and systems, Princeton, pp 13–24
Manthalkar R, Biswas PK, An automatic script identification scheme for Indian languages. www.ee.iitb.ac.in/uma/~ncc2002/proc/NCC-2002/pdf/n028.pdf.
Mantas J (1986) An overview of character recognition methodologies. Pattern Recognit 19:425–430
Roy K, Pal U, Chaudhuri BB (2005) A system for neural network based word-wise handwritten script identification for Indian postal automation. In: Second international conference on intelligent sensing and information processing and control, Mysore, pp 581–586
Lam L, Ding J, Suen CY (1998) Differentiating between oriental and European scripts by statistical features. Int J Pattern Recognit Artif Intell 12(1):63–79
Hochberg J, Cannon M, Kelly P, White J (1997) Page segmentation using script identification vectors: a first look. In: Proceedings of the 1997 symposium on document image understanding technology, Annapolis, pp 258–264
Pal U, Chaudhuri BB (2002) Identification of different script lines from multi-script documents. Image Vis Comput 20(13–14):945–954
Padma MC, Nagabhushan P (2003) Identification and separation of text words of Kannada, Hindi and English languages through discriminating features. In: Proceedings of the 2nd Indian conference on document analysis and recognition, Mysore, India, pp 252–260
Jawahar CV, Pavan Kumar MNSSK, Ravi Kiran SS (2003) A bilingual OCR for Hindi–Telugu documents and its applications. In: Proceedings of the international conference on document analysis and recognition, Edinburgh, Aug 2003 pp 408–412
Chanda S, Sinha S, Pal U (2004) Word-wise English Devnagari and Oriya script identification. In: Sinha RMK, Shukla VN (eds) Speech and language systems for human communication. Tata McGraw-Hill, New Delhi, pp. 244–248
Jain AK, Zhong Y (1996) Page segmentation using texture analysis. Pattern Recognit 29(5):743–770
Singhal V, Navin N, Ghosh D (2003) Script-based classification of handwritten text documents in a multilingual environment. In: Proceedings of the 13th international workshop on research issues in data engineering–multilingual information management, Hyderabad, pp 47–53
Joshi GD, Garg S, Sivaswamy J (2006) Script identification from Indian documents. In: Proceedings of the IAPR international workshop document analysis systems, Feb 2006, Nelson, New Zealand, pp 255–267
Ma H, Doermann D (2003) Gabor filter based multi-class classifier for scanned document images. In: Proceedings of the international conference on document analysis and recognition, Aug 2003, Edinburgh, Scotland, pp 968–972
Jaeger S, Ma H, Doermann D (2005) Identifying script on word-level with informational confidence. In: Proceedings of the international conference on document analysis and recognition, Aug/Sept 2005, vol 1, pp 416–420
Chanda S, Pal U, Terrades OR (2009) Word-wise Thai and Roman script identification. ACM Trans Asian Lang Inf Process 8(11):1–21
Zhou L, Lu Y, Tan CL (2006) Bangla/English script identification based on analysis of connected component profiles. In: Proceedings of the 7th international workshop on document analysis and systems, Nelson, pp 243–254
Dhandra BV, Nagabhushan P, Hangarge M, Hegadi R, Malemath VS (2006) Script identification based on morphological reconstruction in document images. In: Proceedings of the international conference on pattern recognition (ICPR’06), Hong Kong, pp 950–953
Hou HS (1983) Digital document processing. Wiley, New York
Zramdini A, Ingold R (1993) Optical font recognition from projection profiles. Electron Publi 6(3):249–260
Zhu Y, Tan T, Wang Y (2001) Font recognition based on global texture analysis. IEEE Trans PAMI 23(10):1192–1200
Zramdini A, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans PAMI 20(8):887–882
Khoubyari S, Hull JJ (1996) Font and function word identification in document recognition. Comput Vis Image Underst 63(1):66–74
Jeong CB, Kwag HK, Kim SH, Kim JS, Park SC (2003) Identification of font styles and typefaces in printed Korean documents. In: Sembok TMT et al (eds) ICADL 2003, Kuala Lumpur. LNCS 2911, pp 666–669
Sharma N., Chanda S, Pal U, Blumenstein U (2013) Word-wise script identification from video frames, 12th International conference on document analysis and recognition, Washington DC, USA pp 867–871
Morris RA (1988) Classification of digital typefaces using spectral signatures. Pattern Recognit 25(8):869–876
Anigbogu JCh (1992) Reconnaissance de Textes Imprimés Multifontes à l’Aide de Modèles Stochastiques et Métriques. Ph.D. dissertation, Université de Nancy I
Abuhaiba ISI (2003) Arabic font recognition based on templates. Int Arab J Inf Technol 1:33–39
Chanda S, Pal U, Franke K (2012) Font identification: in context of an Indic script. In: Proceedings of the 21st international conference on pattern recognition (ICPR), Tsukuba, pp 1655–1658
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag London
About this entry
Cite this entry
Pal, U., Dash, N.S. (2014). Language, Script, and Font Recognition. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_9
Download citation
DOI: https://doi.org/10.1007/978-0-85729-859-1_9
Published:
Publisher Name: Springer, London
Print ISBN: 978-0-85729-858-4
Online ISBN: 978-0-85729-859-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering