Shape decomposition-based handwritten compound character recognition for Bangla OCR
Graphical abstract
Introduction
Recognition of offline handwritten characters is an ongoing topic of research. Several work has been published on recognition of Roman [1], [2], Arabic [3], [4], [5], Chinese [6], [7], [8], and Japanese [9], [10] scripts, but only a handful of studies have been done on recognition of printed as well as of handwritten characters in Bangla script [11], [12], [13], [14], [15], [16]. The presence of complex-shaped compound (also known as conjunct) characters with their cursive nature in Bangla script makes the recognition problem much more difficult when compared with other scripts [17]. The number of research work further decreases when recognition of compound characters in Bangla script is considered. Garain and Chaudhuri [18] have used feature and run based normalized template matching technique for the recognition of printed Bangla compound characters. Pal et al. [19] have proposed a method to recognize 138 classes of compound characters by using modified quadratic discriminant function (MQDF). Directional information acquired from the arc tangent of the gradient is used as features. Das et al. [20] have introduced a technique that uses multi-layer perceptron (MLP) based classifier and quadtree-based longest run feature to recognize 55 handwritten compound character classes that covers 90% of the total compound character usage. Later, they [21] have extended their work by proposing a MLP and support vector machine (SVM) based strategy that uses shadow and quadtree-based longest run features to recognize 93 character classes including 50 basic and 43 compound handwritten Bangla characters. Das et al. [22] have developed a combined genetic algorithm (GA) and SVM based multi-stage classification for handwritten Bangla compound characters. In this method, SVM is used to perform the first stage of classification followed by GA that handles the classes which were misclassified by SVM. A group of different feature set such as shadow, octant centroid, quadtree-based longest run, and different topological attributes are used to form the overall feature set for the recognition purpose. Bag et al. [23] have proposed a method that decomposes the compound characters into skeletal segments for the improvement of recognition accuracy. In this method, convex shape primitives are extracted to form the structural feature set and template matching scheme is used to recognize the handwritten Bangla compound characters.
We have noticed that a very limited number of work on recognition of complex-shaped Bangla compound characters exist in the literature till now. Now-a-days, researchers are focussing on structural feature set to handle complex structural shape of a compound character. But the extraction of proper structural features from a complex-shaped compound character is itself a big challenge. If by some means the structural complexity of the compound characters be reduced during the preprocessing stages, then we will get better recognition accuracy for such characters. Our work is based on the premise that classification and recognition of two basic characters instead of a complex-shaped compound character would not only provide better recognition accuracy, but also reduce the classification complexity as the number of class diminishes. This is significant as classes for compound characters are no longer needed to be defined separately. To the best of our knowledge no work has been published on this aspect for Bangla handwritten complex-shaped compound characters. In this paper, we propose a shape decomposition-based strategy to segment the compound characters into basic shapes, that provides better classification and recognition accuracy and at the same time reduces classification complexity.
The remaining paper is organized as follows. We discuss the characteristics of the Bangla language in Section 2, followed by our proposed methodology in Section 3. The experimental results are discussed in Section 4. Finally, a brief conclusion is presented in Section 5.
Section snippets
Characteristics of Bangla language
The Constitution of India registers 22 languages, namely Assamese, Bangla, Bodo, Dogri, Gujrati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Panjabi, Sanskrit, Santhali, Sindhi, Tamil, Telugu, and Urdu. These languages are written in 13 different scripts with over 720 dialects. Bangla script includes Bangla, Assamese, and Manipuri language. The importance of Bangla language can be acknowledged from the fact that it is the seventh most spoken
Proposed methodology
We propose a shape decomposition-based strategy to segment the compound characters into prominent basic shapes for better classification and recognition accuracy. The proposed methodology is divided into four parts: preprocessing and segmentation area detection, group formation and shape decomposition, feature extraction, and classification. We perform sequential steps to identify the segmentation area where the two basic shapes are amalgamated to form a compound character. Once the area is
Dataset
We have used the ICDAR 2013 Segmentation Dataset [31] and Cmaterdb [32] dataset (version 3.1.3.1 for compound characters and version 3.1.2 for basic characters) for our experimental purpose. Both these dataset are unbiased with varied elements and are used by most researchers working in this particular domain. A total of 10,240 sample images of compound characters are used to carry out the experiment. We have used 12,300 basic character images for training purpose. All modules used for
Conclusion
Recognition of cursive Bangla handwritten characters has always been a challenge for researchers. The presence of compound characters makes the task much more difficult. Reduction of structural complexity of these compound characters would make recognition more accurate. In the current work, we have developed a shape decomposition-based methodology to segment the complex-shaped compound characters into two basic, simple, and prominent shapes. Our achievement is two fold. We not only achieve
References (32)
- et al.
Offline handwritten Arabic cursive text recognition using Hidden Markov Models and re-ranking
Pattern Recogn. Lett.
(2011) - et al.
Online and offline handwritten Chinese character recognition: benchmarking on new databases
Pattern Recogn.
(2013) - et al.
Bangla handwritten character recognition using convolutional neural network
Int. J. Image, Graph. Signal Process. (IJIGSP)
(2015) - et al.
Recognition of Bangla compound characters using structural decomposition
Pattern Recogn.
(2014) - et al.
Off-line Roman cursive handwriting recognition
Digital Docum. Process.
(2007) - F. Li, S. Gao, Character recognition system based on back-propagation neural network, in: International Conference on...
- M. Rashad, K. Amin, M. Hadhoud, W. Elkilani, Arabic character recognition using statistical and geometric moment...
- et al.
Arabic handwriting recognition using structural and syntactic pattern attributes
Pattern Recogn.
(2013) - et al.
Handwritten Chinese text recognition by integrating multiple contexts
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - Z. Zhong, L. Jin, Z. Xie, High performance offline handwritten Chinese character recognition using GoogLeNet and...
Handwritten Chinese/Japanese text recognition using semi-Markov conditional random fields
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (0)
This paper has been recommended for acceptance by Zicheng Liu.