Elsevier

Computer Science Review

Volumes 15–16, February–May 2015, Pages 1-28
Computer Science Review

Survey
Offline Script Identification from multilingual Indic-script documents: A state-of-the-art

https://doi.org/10.1016/j.cosrev.2014.12.001Get rights and content

Abstract

Offline Script Identification (OSI) facilitates many important applications such as automatic archiving of multilingual documents, searching online/offline archives of document images and for the selection of script specific Optical Character Recognition (OCR) in a multilingual environment. In a multilingual country like India, a document containing text words in more than one language is a common scenario. A state-of-the-art survey about the techniques available in the area of OSI for Indic scripts would be of a great aid to the researchers. Hence, a sincere attempt is made in this article to discuss the advancements reported in the literature during the last few decades. Various feature extraction and classification techniques associated with the OSI of the Indic scripts are discussed in this survey. We hope that this survey will serve as a compendium not only for researchers in India, but also for policymakers and practitioners in India. It will also help to accomplish a target of bringing the researchers working on different Indic scripts together. Taking the recent developments in OSI of Indian regional scripts into consideration, this article will provide a better platform for future research activities.

Introduction

With the advancement in computer technology and the availability of low cost high capacity storage devices, storing of documents in electronic form has become a common practice. A document either in handwritten or printed form may contain writings in different scripts, graphics and images. For example, museum archives contain old fragile documents having scientific or historical or artistic value and written in different scripts with many graphic illustrations. OCR is a type of software designed to translate images of text into machine editable text. However, most OCR systems are script-specific in the sense that they can read characters written in a particular script only. Script is defined as the graphic form of the writing system. A script class refers to a particular style of writing and the set of characters used in it. Languages throughout this world are typeset in many different scripts. A script may be used by only one language or may be shared by many languages, sometimes with slight variations from one language to other. For example, Devnagari script is used for writing a number of Indic languages like Sanskrit, Hindi, Konkani, Marathi, Nepali, etc. It is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. So, for a multilingual text document, identification of different scripts and extraction of portions written in the same script is a pressing need so that script-specific OCR systems can be employed. However, manual identification of the mixed-script document may be too monotonous and imperceptible. So, automatic OSI techniques are necessary to identify the scripts in the given input document which can further be sent to their corresponding OCR engines. Fig. 1 shows some examples of mixed-script documents.

Script identification is a key step in document image analysis especially when the environment is multi-script and multilingual. It also serves as an essential precursor for recognizing the language in which a document is written. This is necessary for further processing of the document, such as searching, indexing or translation. For scripts used by only one language, script identification itself accomplishes language identification. For scripts shared by many languages, script recognition acts as the first level of classification followed by language identification within the script.

India is a highly multilingual country with 22 constitutionally recognized languages. Besides these, hundreds of other languages are used in India, each one with a number of dialects. The officially recognized languages are Hindi, Bengali, Punjabi, Marathi, Gujarati, Oriya, Sindhi, Assamese, Nepali, Urdu, Sanskrit, Tamil, Telugu, Kannada, Malayalam, Kashmiri, Manipuri, Konkani, Maithali, Santhali, Bodo, and Dogari. Hindi, written in Devnagari script, is India’s official language and has the most speakers, estimated to be more than 500 million. Indic scripts are a logical composition of individual script symbols and follow a common logical structure. This can be referred to as the “script composition grammar” which has no counterpart in any other set of scripts in the world. Indic scripts are written syllabically and are usually visually composed in three tiers where constituent symbols in each tier play specific roles in the interpretation of that syllable  [1].

Automatic script identification in a multilingual environment is a challenging research problem over the last two decades  [2]. Researchers have investigated OCR for a number of Indic scripts. However, most of these researches have been confined to the identification of isolated characters rather than the script. Unlike simple concurrence in Roman script, the Indic scripts are a composition of the constituent symbols in two dimensions. This implies that researchers first segment an Indic script word into its composite characters and then each composite character is decomposed into the constituent symbols or strokes that are finally recognized. Fig. 2 shows the general block diagram of the script identification system.

The process of identification of printed/handwritten script includes preprocessing and segmentation, feature extraction and recognition or classification processes. Preprocessing is necessary when the data or image is captured for further processing. Preprocessing is a method of enhancing the quality of an image for better understanding of the image. The choice of preprocessing method to be adopted on a document image depends on the type of application for which the image is used. The noise gets introduced in the document image during its acquisition and/or transmission over wired/wireless channels, as well as because of changing of some parameters of acquisition system in the OCR. Skew is an unavoidable distortion that is often introduced during scanning or copying of a document. There are many techniques that are generally available to accomplish preprocessing on images; however, several experiments on script identification suggest that preprocessing methods have got to be customized to suit the requirements of script identification. Preprocessing techniques generally include noise removal, scaling, binarization, skew and slant correction including the header line removal, etc. Binarization is the process of converting 256 levels of gray scale information into two levels (black and white) image information. To binarize a gray scale image, at first, threshold value(s) has to be determined. If a pixel value is less than the threshold value, then the pixel value of the corresponding output image is set as 1 (black) otherwise it is set as 0 (white). The morphological opening and closing operators  [3] are frequently used for removing noise from the document images but it also connects discontinuities that are caused during the thresholding stage. The opening and closing operators are defined as follows:AB=(AB)BandAB=(AB)B where, and are the morphological erosion and dilation operators respectively and B is the related structure element.

Next, the text lines and then the words are segmented from the document images. For printed Indic scripts, it has been found that the horizontal and vertical projection profiles are useful for segmenting text lines and words respectively from document images. The horizontal projection profile method computes sum of all black pixels on every row and constructs the corresponding horizontal pixel density histogram. Based on the peak/valley points of the histogram, individual text lines are separated as shown in Fig. 3(a). When the density of data pixels in a row is zero, it denotes a boundary between two consecutive lines. After the text line segmentation, each text line is scanned vertically for word segmentation as shown in Fig. 3(b). For vertical projection profile, the number of data pixels corresponding to each column is calculated to construct a vertical pixel density histogram. So, by analyzing the vertical pixel density histogram of a text line, words are easily separated through the columns having data pixel density equal to zero.

Feature extraction is a vital part of any recognition system. The main intention behind feature extraction is to depict the pattern by means of a bare minimum number of attributes. One significant job in the design of a pattern recognition system is to develop an algorithm to extort characteristics or features of pattern from initial measurement. Feature selection is a process of minimizing the number of features and maximizing the discriminating property of the feature set. This process aims to identify an optimal subset of relevant features from a large number of features collected from the patterns in the dataset, such that the overall accuracy of classification is increased. Recognition or classification of a particular pattern depends on the selection of the features and classifiers which can classify or recognize a particular pattern belonging to a particular class. The classifier is sometimes called the ‘heart’ of the pattern recognition system. It takes a feature vector and assigns to it a decision where the decision is the label of pattern class labels which the pattern recognizer has decided.

From the literature survey, it can be seen that only a few researchers have taken the burden of writing a review article on the script identification techniques. Some of such reviews described in  [4], [5], [6], [7] are worth mentioning. U. Pal et al.  [4] presented a review of an OCR system for Indic language scripts. A survey of script identification techniques for multi-script document images is also described in  [5]. A report on the key technologies in multilingual OCR of handwritten Devnagari script can also be found in  [6]. In one of the earlier works  [7], a review of offline handwritten script identification techniques is also described.

All existing works on automatic script/language identification are broadly classified into either local approach or global approach  [7]. In local approach, the features are extracted from a list of connected components such as line, word and character, which are obtained only after segmenting the underlying document image. So, the success rate of classification depends on the effectiveness of the pre-processing steps. But, it is difficult to find a common segmentation method that best suits for all the script classes. Due to this limitation, local approaches cannot meet the criterion as a generalized scheme. In contrast, global approaches employ analysis of regions comprising of at least two text lines and hence fine segmentation of the underlying document into lines, words and characters, is not necessary. Consequently, the script classification task is simplified and performed faster with the global approach than the local approach. In this paper, we present a comprehensive survey of different script identification techniques developed mainly for identification of the major scripts of India.

The organization of the survey is as follows: Section  2 covers the properties and evolution of Indic scripts whereas Section  3 discusses the methods used for OSI of Indic scripts. Sections  4 Structure-based script identification, 5 Visual appearance-based script identification describes the different script identification techniques related to OSI. Section  6 presents a comparative analysis of some of the benchmark work for OSI of Indic scripts. Section  7 provides a snapshot of the benchmark databases available for researchers in this domain. Section  8 describes the scope of future work followed by some of the existing difficulties in this domain whereas Section  9 concludes the review.

Section snippets

Properties of Indic scripts

Scripts symbolize the writing systems employed by the languages to represent the sounds which form the phonetic base of the languages. In the Indian subcontinent, besides Roman script, 12 major modern scripts are currently being used: Devnagari, Bangla, Oriya, Gujarati, Gurumukhi, Tamil, Telugu, Kannada, Malayalam, Manipuri, Sinhala and Urdu. Of these, Urdu is derived from the Persian script and is written from right to left. The other 11 scripts, written from left to right, originated from the

Script identification methods and work on Indic scripts

Script identification relies on the fact that each script has unique spatial distribution and visual attributes that make it possible to distinguish it from other scripts. So, the basic task involved in script recognition is to devise a technique to discover these features/attributes from a given document and then classify the document’s script accordingly. Based on the nature of approach and features used, these methods may be divided into two broad categories—structure-based and visual

Structure-based script identification

In general, script classes differ from each other in their stroke structure and connections, and the writing styles associated with the character sets they use. One approach to script recognition may be to extract connected components (continuous runs of object pixels) in a document and then analyze the shapes and structures of the script used in the document. In Indic scripts like Devnagari, Bangla, Urdu, etc., a word or a part of a word forms a connected component. Based on the granularity of

Visual appearance-based script identification

Script types generally differ from each other due to the shape of individual characters, and the way they are grouped into words, words into text lines, etc. This gives distinctively different visual appearances for different scripts. Texture could be defined in simple form as “repetitive occurrence of the same pattern”. Another definition of texture claims that, “an image region has a constant texture if a set of its local properties in that region is constant, slowly changing or approximately

Comparative analysis of the proposed work

It is evident from the state-of-the-art that various features has been used by different researchers for their script recognition purpose. However, the results reported by them, although quite encouraging on most occasions, are obtained using only a selected number of script classes in their experiments. This leaves a question that how these script features will perform when applied to scripts other than those considered in their works. Therefore, it is important to investigate the

Databases available for OSI

In recent years efforts to create datasets for Indic languages are being reported in the literature. The study for dataset for Indic scripts has got prime attention in last decades. Previous researches on Indic script recognition systems were reported on the basis of databases collected in the laboratory. Most of them tested their algorithms on artificially crafted datasets. However, some of them have taken the challenge and prepared benchmark databases for several Indic languages. The summary

Scope of future work

Performance of the works reported above for Indic script OSI may be improved if the following issues are given proper attention. Some of them are discussed below:

Conclusion

This paper presents a comprehensive survey on the developments in Indic script recognition techniques which is an important issue in OCR research in multilingual/multi-script world. Researchers have attempted to characterize different scripts either by extracting their structural features or by deriving some visual attributes. Accordingly, many different script features have been proposed over the years for OSI. Script identification, in general, done either in text line-wise or word-wise are

Acknowledgments

Authors are grateful to the Center for Microprocessor Application for Training Education and Research (CMATER) and Project on Storage Retrieval and Understanding of Video for Multimedia (SRUVM) of Computer Science & Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work. The current work, reported here, has been partially funded by University with Potential for Excellence (UPE), Phase-II, UGC, Government of India.

References (97)

  • H. Scharfe

    Kharosti and Brahmi

    J. Am. Oriental Soc.

    (2002)
  • R.C. Gonzalez et al.

    Digital Image Processing

    (1992)
  • S. Abirami et al.

    A survey of script identification techniques for multi-script document images

    Int. J. Recent Trends Eng.

    (2009)
  • A.S. Ramteke et al.

    A survey of offline recognition of handwritten Devangari script

    Int. J. Sci. Eng. Res.

    (2012)
  • D.S. Guru et al.

    A review of offline handwritten script identification

    National Conference on Advanced Computing and Communications, NCACC, April 2012

    Int. J. Comput. Appl.

    (2012)
  • A.S. Mahmud, Crisis and Need: information and communication technology in development initiatives runs through a...
  • B.V. Dhandra, P. Nagabhushan, M. Hangarge, R. Hegadi, V.S. Malemath, Script identification based on morphological...
  • L. Vincent

    Morphological gray scale reconstruction in image analysis: applications and efficient algorithms

    IEEE Trans. Image Process.

    (1993)
  • K. Roy, S.K. Das, Sk.Md. Obaidullah, Script identification from handwritten documents, in: Proc. of 3rd National...
  • B.B. Mandelbrot, The Fractal Geometry of Nature Freeman, NY,...
  • Sk.Md. Obaidullah et al.

    A system for handwritten script identification from Indian document

    J. Pattern Recognit. Res.

    (2013)
  • G.S. Rao et al.

    Script identification of Telugu, English and Hindi document image

    Int. J. Adv. Eng. Global Technol.

    (2014)
  • R. Gopakumar et al.

    Script identification from multilingual Indian documents using structural features

    J. Comput.

    (2010)
  • M.C. Padma et al.

    Identification of Telugu, Devnagari and English scripts using discriminating features

    Int. J. Comput. Sci. Inf. Technol. (IJCSIT)

    (2009)
  • U. Pal, S. Sinha, B.B. Chaudhuri, Multi-script line identification from Indian documents, in: Proc. of 7th...
  • B. Kumar et al.

    Line based robust script identification for Indian languages

    Int. J. Inf. Electron. Eng.

    (2012)
  • S. Mohanty et al.

    A novel approach for Bilingual (English-Oriya) script identification and recognition in a printed document

    Int. J. Image Process. (IJIP)

    (2010)
  • P.K. Aithal, G. Rajesh, D.U. Acharya, M. Krishnamoorthi, N.V. Subbareddy, Text line script identification for a...
  • P.K. Aithal et al.

    Multi-script lne identification system for Indian languages

    J. Comput.

    (2010)
  • M. Hangarge et al.

    Offline handwritten script identification in document images

    Int. J. Comput. Appl. (IJCA)

    (2010)
  • M.C. Padma et al.

    Script identification from trilingual documents using profile based features

    Int. J. Comput. Sci. Appl. (IJCSA)

    (2010)
  • U. Pal, B.B. Chaudhuri, Script line separation from Indian multi-script documents, in: Proc. of 5th International...
  • U. Pal, B.B. Choudhuri, Automatic separation of words in multi lingual multi script Indian documents, in: Proc. of 4th...
  • S. Sinha et al.

    Word-wise script identification from Indian documents

  • S. Chanda, S. Pal, U. Pal, Word-wise Sinhala, Tamil and English script identification using Gaussian kernel SVM, in:...
  • S. Chanda, S. Pal, K. Franke, U. Pal, Two-stage approach for word-wise script identification, in: Proc. of 10th...
  • U. Pal, N. Sharma, T. Wakabayashi, F. Kimura, Handwritten numeral recognition of six popular Indian scripts, in: Proc....
  • S.K. Sangame et al.

    Script identification of text words from a bilingual document using voting techniques

    World J. Sci. Technol.

    (2012)
  • E. Hassan, R. Garg, S. Chaudhury, M. Gopal, Script based text identification: a multi-level architecture, in: Proc. of...
  • P. Viola et al.

    Robust real-time face detection

    Int. J. Comput. Vis.

    (2004)
  • E. Hassan, S. Chaudhury, M. Gopal, Documentimage retrieval using feature combination in kernelspace, in: Proc. of 20th...
  • B.V. Dhandra, H. Mallikarjun, R. Hegadi, V.S. Malemath, Word-wise script identification from Bilingual documents based...
  • A. Kumar et al.

    Discrimination of English to other Indian languages (Kannada and Hindi) for OCR system

    Int. J. Comput. Sci. Eng. Appl. (IJCSEA)

    (2012)
  • O. Prakash et al.

    An efficient approach for script identification

    Int. J. Comput. Trends Technol. (IJCTT)

    (2013)
  • N. Vishwanath et al.

    Classification of scripts using vertical stroke feature

    Int. J. Eng. Res. Appl. (IJERA)

    (2012)
  • M.C. Padma et al.

    Script identification of text words from a tri lingual document using voting technique

    Int. J. Image Process.

    (2010)
  • R. Sarkar et al.

    Word level script identification from Bangla and Devnagari handwritten texts mixed with Roman scripts

    J. Comput.

    (2010)
  • A. Khandelwal, P. Choudhury, R. Sarkar, S. Basu, M. Nasipuri, N. Das, Text line segmentation for unconstrained...
  • Cited by (0)

    View full text