Shape decomposition-based handwritten compound character recognition for Bangla OCR

https://doi.org/10.1016/j.jvcir.2017.11.016Get rights and content

Highlights

Abstract

Proper recognition of complex-shaped handwritten compound characters is still a big challenge for Bangla OCR systems. In this paper, we propose a novel shape decomposition-based segmentation technique to decompose the compound characters into prominent shape components. This shape decomposition reduces the classification complexity in terms of less number of classes to recognize, and at the same time improves the recognition accuracy. The decomposition is done at the segmentation area where the two basic shapes are joined to form a compound character. We use chain code histogram feature set with multi-layer perceptron (MLP) based classifier with backpropagation learning for classification. On experimentation, the proposed method is observed to provide good recognition accuracy comparing with other existing methods.

Introduction

Recognition of offline handwritten characters is an ongoing topic of research. Several work has been published on recognition of Roman [1], [2], Arabic [3], [4], [5], Chinese [6], [7], [8], and Japanese [9], [10] scripts, but only a handful of studies have been done on recognition of printed as well as of handwritten characters in Bangla script [11], [12], [13], [14], [15], [16]. The presence of complex-shaped compound (also known as conjunct) characters with their cursive nature in Bangla script makes the recognition problem much more difficult when compared with other scripts [17]. The number of research work further decreases when recognition of compound characters in Bangla script is considered. Garain and Chaudhuri [18] have used feature and run based normalized template matching technique for the recognition of printed Bangla compound characters. Pal et al. [19] have proposed a method to recognize 138 classes of compound characters by using modified quadratic discriminant function (MQDF). Directional information acquired from the arc tangent of the gradient is used as features. Das et al. [20] have introduced a technique that uses multi-layer perceptron (MLP) based classifier and quadtree-based longest run feature to recognize 55 handwritten compound character classes that covers 90% of the total compound character usage. Later, they [21] have extended their work by proposing a MLP and support vector machine (SVM) based strategy that uses shadow and quadtree-based longest run features to recognize 93 character classes including 50 basic and 43 compound handwritten Bangla characters. Das et al. [22] have developed a combined genetic algorithm (GA) and SVM based multi-stage classification for handwritten Bangla compound characters. In this method, SVM is used to perform the first stage of classification followed by GA that handles the classes which were misclassified by SVM. A group of different feature set such as shadow, octant centroid, quadtree-based longest run, and different topological attributes are used to form the overall feature set for the recognition purpose. Bag et al. [23] have proposed a method that decomposes the compound characters into skeletal segments for the improvement of recognition accuracy. In this method, convex shape primitives are extracted to form the structural feature set and template matching scheme is used to recognize the handwritten Bangla compound characters.

We have noticed that a very limited number of work on recognition of complex-shaped Bangla compound characters exist in the literature till now. Now-a-days, researchers are focussing on structural feature set to handle complex structural shape of a compound character. But the extraction of proper structural features from a complex-shaped compound character is itself a big challenge. If by some means the structural complexity of the compound characters be reduced during the preprocessing stages, then we will get better recognition accuracy for such characters. Our work is based on the premise that classification and recognition of two basic characters instead of a complex-shaped compound character would not only provide better recognition accuracy, but also reduce the classification complexity as the number of class diminishes. This is significant as classes for compound characters are no longer needed to be defined separately. To the best of our knowledge no work has been published on this aspect for Bangla handwritten complex-shaped compound characters. In this paper, we propose a shape decomposition-based strategy to segment the compound characters into basic shapes, that provides better classification and recognition accuracy and at the same time reduces classification complexity.

The remaining paper is organized as follows. We discuss the characteristics of the Bangla language in Section 2, followed by our proposed methodology in Section 3. The experimental results are discussed in Section 4. Finally, a brief conclusion is presented in Section 5.

Section snippets

Characteristics of Bangla language

The Constitution of India registers 22 languages, namely Assamese, Bangla, Bodo, Dogri, Gujrati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Panjabi, Sanskrit, Santhali, Sindhi, Tamil, Telugu, and Urdu. These languages are written in 13 different scripts with over 720 dialects. Bangla script includes Bangla, Assamese, and Manipuri language. The importance of Bangla language can be acknowledged from the fact that it is the seventh most spoken

Proposed methodology

We propose a shape decomposition-based strategy to segment the compound characters into prominent basic shapes for better classification and recognition accuracy. The proposed methodology is divided into four parts: preprocessing and segmentation area detection, group formation and shape decomposition, feature extraction, and classification. We perform sequential steps to identify the segmentation area where the two basic shapes are amalgamated to form a compound character. Once the area is

Dataset

We have used the ICDAR 2013 Segmentation Dataset [31] and Cmaterdb [32] dataset (version 3.1.3.1 for compound characters and version 3.1.2 for basic characters) for our experimental purpose. Both these dataset are unbiased with varied elements and are used by most researchers working in this particular domain. A total of 10,240 sample images of compound characters are used to carry out the experiment. We have used 12,300 basic character images for training purpose. All modules used for

Conclusion

Recognition of cursive Bangla handwritten characters has always been a challenge for researchers. The presence of compound characters makes the task much more difficult. Reduction of structural complexity of these compound characters would make recognition more accurate. In the current work, we have developed a shape decomposition-based methodology to segment the complex-shaped compound characters into two basic, simple, and prominent shapes. Our achievement is two fold. We not only achieve

References (32)

  • Y. Sobu, H. Goto, H. Aso, Binary tree-based precision-keeping clustering for very fast Japanese character recognition,...
  • X.D. Zhou et al.

    Handwritten Chinese/Japanese text recognition using semi-Markov conditional random fields

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • R. Sarkar, N. Das, S. Basu, M. Kundu, M. Nasipuri, D.K. Basu, A two-stage approach for segmentation of handwritten...
  • S. Bag, P. Bhowmick, G. Harit, A. Biswas, Character segmentation of handwritten Bangla text by vertex characterization...
  • S. Mandal, S. Sur, A. Dan, P. Bhowmick, Handwritten Bangla character recognition in machine-printed forms using...
  • S. Bag, P. Bhowmick, G. Harit, Recognition of Bengali handwritten characters using skeletal convexity and dynamic...
  • Cited by (0)

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text