Language, Script, and Font Recognition

Pal, Umapada; Dash, Niladri Sekhar

doi:10.1007/978-0-85729-859-1_9

Umapada Pal³ &
Niladri Sekhar Dash⁴

3907 Accesses
3 Citations

The original version of this chapter was revised: the second author name which was missing has now been added. The correction to this chapter is available at https://doi.org/10.1007/978-0-85729-859-1_40

Abstract

Automatic identification of a language within a text document containing multiple scripts and fonts is a challenging task, as it is not only linked with the shape, size, and style of the characters and symbols used in the formation of the text but also admixed with more crucial factors such as the forms and size of pages, layout of written text, spacing between text lines, design of characters, density of information, directionality of text composition, etc. Therefore, successful management of the various types of information in the act of character, script, and language recognition requires an intelligent system that can elegantly deal with all these factors and issues along with other secondary factors such as language identity, writing system, ethnicity, anthropology, etc. Due to such complexities, identification of script vis-à-vis language has been a real challenge in optical character recognition (OCR) and information retrieval technology. Considering the global upsurge of so-called minor and/or unknown languages, it has become a technological challenge to develop automatic or semiautomatic systems that can identify a language vis-à-vis a script in which a particular piece of text document is composed. Bearing these issues in mind, an attempt is initiated in this chapter to address some of the methods and approaches developed so far for language, script, and font recognition for written text documents. The first section, after presenting a general overview of language, deals with the information about the origin of language, the difficulties faced in language identification, and the existing approaches to language identification. The second section presents an overview of script, differentiates between single- and multiscript documents, describes script identification technologies and the challenges involved therein, focuses on the process of machine-printed script identification, and then addresses the issues involved in handwritten script identification. The third section tries to define font terminologies, addresses the problems involved in font generation, refers to the phenomenon of font variation in a language, and discusses strategies for font and style recognition. Thus, the chapter depicts a panoramic portrait of the three basic components involved in OCR technology: the problems and issues involved, the milestones achieved so far, and the challenges that still lie ahead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 549.99; Price excludes VAT (USA)

Hardcover Book: USD 549.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Greenberg J (1963) Universals of languages. MIT, Cambridge
Google Scholar
Hockett CF (1960) The origin of speech. Sci Am 203:89–97
Article Google Scholar
Skinner BF (1953) Science and human behavior. Macmillan, New York
Google Scholar
Chomsky AN (1959) On certain formal properties of grammars. Inf Control 2:137–167
Article MathSciNet Google Scholar
Pinker S, Bloom P (1990) Natural language and natural selection. Behav Brain Sci 13(4):707–784
Article Google Scholar
Pinker S (1995) The language instinct: the new science of language and mind. Penguin Books, Middlesex
Google Scholar
Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27:1720–1732
Article Google Scholar
Nakayama T, Spitz AL (1993) European language determination from image. In: Proceedings of the 2nd international conference on document analysis and recognition, Tsukuba, pp 159–162
Google Scholar
Hochberg J, Bowers K, Cannon M, Kelly P (1999) Script and language identification for handwritten document images. Int J Doc Anal Recognit 2:45–52
Article Google Scholar
Sibun P, Spitz AL (1994) Language determination: natural language processing from scanned document images. In: Proceedings of the applied natural language processing, Stuttgart, pp 15–21
Google Scholar
Beesley KR (1988) Language identifier: a computer program for automatic natural-language identification of on-line text. In: Language at crossroads: proceedings of the 29th annual conference of the American Translators Association, Seattle, pp 47–54
Google Scholar
Cavner WB, Trenkle JM (1994) N-gram based text categorization. In: Proceedings of the third annual symposium of document analysis and information retrieval, Las Vegas, pp 161–169
Google Scholar
Cole RA, Mariani J, Uszkoreit H, Zaenen A, Zue V (eds) (1997) Survey of the state of the art in human language technology. Cambridge University Press, Cambridge
Google Scholar
Hays J (1993) Language identification using two and three-letter cluster. Technical report, School of Computer Studies, University of Leeds
Google Scholar
Ingle NC (1976) A language identification table. Inc Linguist 15:98–101
Google Scholar
Mathusamy YK, Barnard E (1994) Reviewing automatic language identification. IEEE Signal Process Mag 11:33–41
Article Google Scholar
Souter C, Churcher G, Hayes J, Hughes J, Johnson S (1994) Natural language identification using corpus-based models. In: Lauridsen K, Lauridsen O (guest eds) Hermes J Linguist 13:183–203
Google Scholar
Majumder P, Mitra M, Chaudhuri BB (2002) N-gram: a language independent approach to IR and NLP. In: Proceedings of the international conference on universal knowledge and language, Goa, 25–29 Nov 2002
Google Scholar
Dash NS (2011) A descriptive study of the modern Bengali script. Lambert Academic: Saarbrucken
Google Scholar
Ghosh D, Dube T, Shivaprasad AP (2010) Script recognition-a review. IEEE Trans PAMI 32(12):2142–2161
Article Google Scholar
Pal U, Roy PP, Tripathy N, Llados J (2010) Multi-oriented Bangla and Devnagari text recognition. Pattern Recognit 43:4124–4136
Article Google Scholar
Spitz L (1990) Multilingual document recognition. In: Furuta R (ed) Electronic publishing, document manipulation, and typography. Cambridge University Press, Cambridge/ New York/Melbourne, pp 193–206
Google Scholar
Spitz AL (1994) Text characterization by connected component transformation. In: Proceedings of SPIE, document recognition, San Jose, vol 2181, pp 97–105
Article Google Scholar
Spitz L (1997) Determination of the script and language content of document images. IEEE Trans Pattern Anal Mach Intell 19:235–245
Article Google Scholar
Hochberg J, Kelly P, Thomas T, Kerns L (1997) Automatic script identification from document images using cluster-based templates. IEEE Trans Pattern Anal Mach Intell 19:176–181
Article Google Scholar
Hochberg J, Kerns L, Kelly P, Thomas T (1995) Automatic script identification from images using cluster-based templates. In: Proceedings of the 3rd international conference on document analysis and recognition, Montreal, pp 378–381
Google Scholar
Tan TN (1998) Rotation invariant texture features and their use in automatic script identification. IEEE Trans Pattern Anal Mach Intell 20:751–756
Article Google Scholar
Peake GS, Tan TN (1997) Script and language identification from document images. In: Proc. Eighth British Mach. Vision Conf., Essex, UK, vol 2, pp 230–233
Google Scholar
Pam WM, Suen CY, Bui T (2005) Script identification using steerable Gabor filters. In: Proceedings of the 8th international conference on document analysis and recognition, Seoul, pp 883–887
Google Scholar
Ding J, Lam L, Suen CY (1997) Classification of oriental and European scripts by using characteristic features. In: Proceedings of the 4th international conference on document analysis and recognition, Ulm, pp 1023–1027
Google Scholar
Zhang T, Ding X (1999) Cluster-based bilingual script-segmentation and language identification. In: Character recognition and intelligent information processing, Tsinghua University, China, vol 6, pp 137–148
Google Scholar
Lee DS, Nohl CR, Baird HS (1996) Language identification in complex, un-oriented and degraded document images. In: Proceedings of the IAPR workshop on document analysis and systems, Malvern, pp 76–98
Google Scholar
Ablavsky V, Stevens M (2003) Automatic feature selection with applications to script identification of degraded documents. In: Proceedings of the 7th international conference on document analysis and recognition, Edinburgh, pp 750–754
Google Scholar
Elgammal M, Ismail MA (2001) Techniques for language identification for hybrid Arabic–English document images. In: Proceedings of the 6th international conference on document analysis and recognition, Seattle, pp. 1100–1104
Google Scholar
Wood S, Yao X, Krishnamurthi K, Dang L (1995) Language identification for printed text independent of segmentation. In: Proceedings of the international conference on image processing, Washington, DC, pp 428–431
Google Scholar
Tao Y, Tang YY (2001) Discrimination of oriental and Euramerican scripts using fractal feature. In: Proceedings of the 6th international conference on document analysis and recognition, Seattle, pp 1115–1119
Google Scholar
Pal U, Chaudhuri BB (1999) Script line separation from Indian multi-script documents. In: Proceedings of the 5th international conference on document analysis and recognition, Bangalore, pp 406–409
Google Scholar
Pal U, Sinha S, Chaudhuri BB (2003) Multi-script line identification from Indian documents. In: Proceedings of the 7th international conference on document analysis and recognition, Edinburgh, pp 880–884
Google Scholar
Chaudhuri S, Sheth R (1999) Trainable script identification strategies for Indian languages. In: Proceedings of the fifth international conference on document analysis and recognition, Bangalore, pp 657–660
Google Scholar
Pal U, Chaudhuri BB (2001) Automatic identification Of English, Chinese, Arabic, Devnagari and Bangla script line. In: Proceedings of the sixth international conference on document analysis and recognition, Seattle, pp 790–794
Google Scholar
Wang KY, Casey RG, Wahl FM (1982) Document analysis system. IBM J Res Dev 26:647–656
Article Google Scholar
Pal U, Roy PP (2004) Multi-oriented and curved text lines extraction from Indian documents. IEEE Trans Syst Man Cybern B 34:1676–1684
Article Google Scholar
Pal U, Chaudhuri BB (1997) Automatic separation of words in Indian multi-lingual multi-script documents. In: Proceedings of the fourth international conference on document analysis and recognition, Ulm, pp 576–579
Google Scholar
Patil SB, Subbareddy NV (2002) Neural network based system for script identification in Indian documents. Sadhana 27:83–97
Article Google Scholar
Dhanya D, Ramakrishna AG, Pati PB (2002) Script identification in printed bilingual documents. Sadhana 27:73–82
Article Google Scholar
Dhanya D, Ramakrishna AG (2002) Script identification in printed bilingual documents. In: Proceedings of the document analysis and systems, Princeton, pp 13–24
Google Scholar
Manthalkar R, Biswas PK, An automatic script identification scheme for Indian languages. www.ee.iitb.ac.in/uma/~ncc2002/proc/NCC-2002/pdf/n028.pdf.
Mantas J (1986) An overview of character recognition methodologies. Pattern Recognit 19:425–430
Article Google Scholar
Roy K, Pal U, Chaudhuri BB (2005) A system for neural network based word-wise handwritten script identification for Indian postal automation. In: Second international conference on intelligent sensing and information processing and control, Mysore, pp 581–586
Google Scholar
Lam L, Ding J, Suen CY (1998) Differentiating between oriental and European scripts by statistical features. Int J Pattern Recognit Artif Intell 12(1):63–79
Article Google Scholar
Hochberg J, Cannon M, Kelly P, White J (1997) Page segmentation using script identification vectors: a first look. In: Proceedings of the 1997 symposium on document image understanding technology, Annapolis, pp 258–264
Google Scholar
Pal U, Chaudhuri BB (2002) Identification of different script lines from multi-script documents. Image Vis Comput 20(13–14):945–954
Article Google Scholar
Padma MC, Nagabhushan P (2003) Identification and separation of text words of Kannada, Hindi and English languages through discriminating features. In: Proceedings of the 2nd Indian conference on document analysis and recognition, Mysore, India, pp 252–260
Google Scholar
Jawahar CV, Pavan Kumar MNSSK, Ravi Kiran SS (2003) A bilingual OCR for Hindi–Telugu documents and its applications. In: Proceedings of the international conference on document analysis and recognition, Edinburgh, Aug 2003 pp 408–412
Google Scholar
Chanda S, Sinha S, Pal U (2004) Word-wise English Devnagari and Oriya script identification. In: Sinha RMK, Shukla VN (eds) Speech and language systems for human communication. Tata McGraw-Hill, New Delhi, pp. 244–248
Google Scholar
Jain AK, Zhong Y (1996) Page segmentation using texture analysis. Pattern Recognit 29(5):743–770
Article Google Scholar
Singhal V, Navin N, Ghosh D (2003) Script-based classification of handwritten text documents in a multilingual environment. In: Proceedings of the 13th international workshop on research issues in data engineering–multilingual information management, Hyderabad, pp 47–53
Google Scholar
Joshi GD, Garg S, Sivaswamy J (2006) Script identification from Indian documents. In: Proceedings of the IAPR international workshop document analysis systems, Feb 2006, Nelson, New Zealand, pp 255–267
Chapter Google Scholar
Ma H, Doermann D (2003) Gabor filter based multi-class classifier for scanned document images. In: Proceedings of the international conference on document analysis and recognition, Aug 2003, Edinburgh, Scotland, pp 968–972
Google Scholar
Jaeger S, Ma H, Doermann D (2005) Identifying script on word-level with informational confidence. In: Proceedings of the international conference on document analysis and recognition, Aug/Sept 2005, vol 1, pp 416–420
Google Scholar
Chanda S, Pal U, Terrades OR (2009) Word-wise Thai and Roman script identification. ACM Trans Asian Lang Inf Process 8(11):1–21
Article Google Scholar
Zhou L, Lu Y, Tan CL (2006) Bangla/English script identification based on analysis of connected component profiles. In: Proceedings of the 7th international workshop on document analysis and systems, Nelson, pp 243–254
Google Scholar
Dhandra BV, Nagabhushan P, Hangarge M, Hegadi R, Malemath VS (2006) Script identification based on morphological reconstruction in document images. In: Proceedings of the international conference on pattern recognition (ICPR’06), Hong Kong, pp 950–953
Google Scholar
Hou HS (1983) Digital document processing. Wiley, New York
Google Scholar
Zramdini A, Ingold R (1993) Optical font recognition from projection profiles. Electron Publi 6(3):249–260
Google Scholar
Zhu Y, Tan T, Wang Y (2001) Font recognition based on global texture analysis. IEEE Trans PAMI 23(10):1192–1200
Article Google Scholar
Zramdini A, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans PAMI 20(8):887–882
Article Google Scholar
Khoubyari S, Hull JJ (1996) Font and function word identification in document recognition. Comput Vis Image Underst 63(1):66–74
Article Google Scholar
Jeong CB, Kwag HK, Kim SH, Kim JS, Park SC (2003) Identification of font styles and typefaces in printed Korean documents. In: Sembok TMT et al (eds) ICADL 2003, Kuala Lumpur. LNCS 2911, pp 666–669
Chapter Google Scholar
http://en.wikipedia.org/wiki/Languages_of_India
Sharma N., Chanda S, Pal U, Blumenstein U (2013) Word-wise script identification from video frames, 12th International conference on document analysis and recognition, Washington DC, USA pp 867–871
Google Scholar
Morris RA (1988) Classification of digital typefaces using spectral signatures. Pattern Recognit 25(8):869–876
Article Google Scholar
Anigbogu JCh (1992) Reconnaissance de Textes Imprimés Multifontes à l’Aide de Modèles Stochastiques et Métriques. Ph.D. dissertation, Université de Nancy I
Google Scholar
Abuhaiba ISI (2003) Arabic font recognition based on templates. Int Arab J Inf Technol 1:33–39
Google Scholar
http://www.ntchosting.com/multimedia/font.html
Chanda S, Pal U, Franke K (2012) Font identification: in context of an Indic script. In: Proceedings of the 21st international conference on pattern recognition (ICPR), Tsukuba, pp 1655–1658
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India
Umapada Pal
Linguistic Research Unit, Indian Statistical Institute, Kolkata, India
Niladri Sekhar Dash

Authors

Umapada Pal
View author publications
You can also search for this author in PubMed Google Scholar
Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Umapada Pal .

Editor information

Editors and Affiliations

University of Maryland, College Park, MD, USA
David Doermann
Université de Lorraine, Nancy, France
Karl Tombre

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Pal, U., Dash, N.S. (2014). Language, Script, and Font Recognition. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_9

Download citation

DOI: https://doi.org/10.1007/978-0-85729-859-1_9
Published: 24 July 2019
Publisher Name: Springer, London
Print ISBN: 978-0-85729-858-4
Online ISBN: 978-0-85729-859-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics