Elsevier

Neurocomputing

Volume 243, 21 June 2017, Pages 80-87
Neurocomputing

Urdu Nastaliq recognition using convolutional–recursive deep learning

https://doi.org/10.1016/j.neucom.2017.02.081Get rights and content

Abstract

Recent developments in recognition of cursive scripts rely on implicit feature extraction methods that provide better results as compared to traditional hand-crafted feature extraction approaches. We present a hybrid approach based on explicit feature extraction by combining convolutional and recursive neural networks for feature learning and classification of cursive Urdu Nastaliq script. The first layer extracts low-level translational invariant features using Convolutional Neural Networks (CNN) which are then forwarded to Multi-dimensional Long Short-Term Memory Neural Networks (MDLSTM) for contextual feature extraction and learning. Experiments are carried out on the publicly available Urdu Printed Text-line Image (UPTI) dataset using the proposed hierarchical combination of CNN and MDLSTM. A recognition rate of up to 98.12% for 44-classes is achieved outperforming the state-of-the-art results on the UPTI dataset.

Introduction

Feature extraction is one of the most significant steps in any machine learning and pattern recognition task. In case the patterns under study are images, selection of salient features from raw image pixels not only enhances the performance of the learning algorithm but also reduces the dimensionality of the representation space and hence the computational complexity of the classification task. As a function of the problem under study, a variety of statistical and structural features computed at global or local levels have been proposed over the years [1], [2]. Extraction of these manual features is expensive in the sense that it requires human expertise and domain knowledge so that the most pertinent and discriminative set of features could be selected. These limitations of manual features motivated researchers to extract and select automated and generalized features using machine learning models, especially, for problems involving visual patterns such as object detection [3], character recognition [4] and face detection [5].

A number of studies have shown that convolutional neural network (CNN), a special type of multi-layer neural network, realizes high recognition rates on a variety of classification problems. CNN represents a robust model that is able to recognize highly variable patterns [6] (such as varying shapes of handwritten characters) and is not affected by distortions or simple transformations of the geometry of patterns. In addition, the model does not require pre-processing to recognize visual patterns or objects as it is able to perform recognition from the raw pixels of images directly. Moreover, these visual patterns are easily detected regardless of their position in the image by observing CNN’s shared weight property. In shared weights property, the CNN model uses replicated filters that have identical weight vectors and have local connectivity. This weight sharing eliminates the redundancy of learning visual patterns at each distinct location, consequently limiting each neuron in the model to have local connectivity to a local region of the entire image. Furthermore, weight sharing and local connectivity reduces over-fitting and computational complexity, giving rise to increased learning efficiency and improved generalizations for machine translation. Due to this robust weight sharing property of CNN architecture, it is sometimes known as shift invariant or shared weight neural network or space invariant artificial neural network. The general architecture of a CNN model illustrated in Fig. 1. The first layer, generally termed as the feature extractor part of the CNN, learns lower order specific features from the raw image pixels [6]. The last layer is the trainable classifier which is used for classification. The feature extractor part also comprises two alternate operations of convolution filtering and sub-sampling. The illustrated model shows a convolution filtering (C) of size 5 × 5 pixels and a down sampling ratio (S) of 2, represented by C1, S1, C2 and S2 respectively.

In a number of studies, CNN model has been used to extract features while another model is applied for classification [8], [9], [10]. These include applications like emotion recognition [11], digit and character recognition [12], [13], [14], [15] and visual image recognition [12]. Huang and LeCun [6] conclude that CNN learns optimal features from the raw images but it is not always optimal for classification. Therefore, the authors merged CNN with SVM, i.e. the features extracted by the CNN are fed to the SVM for classification of generic objects. These generic objects included animals, human figures, airplanes, cars, and trucks. The hybrid system realized a recognition rate of upto 94.1% as compared to 57% (only SVM) and 92.8% (only CNN).

In [8], Lauer et al. employed CNN to extract features without prior knowledge on the data for recognition of handwritten digits. Combining the features learned by the CNN with SVM, the authors report a recognition rate of 99.46% (after applying elastic distortions) on the MNIST database. In another similar study, Niu and Suen [9] employed CNN as a trainable feature extractor from raw images and used SVM as recognizer to classify the handwritten digits in the MNIST digit database. This hybrid systems realized a recognition rate of 94.40%.

Donahue et al. [10] investigated the combination of CNN and LSTM (Long-Short-Term-Memory network) for visual image recognition on UCF-101 database [16], Flickr30k database [17] and the COCO2014 database [18]. The combination reported promising classification results on these databases. In another interesting work [19], authors report the combination of convolution and recursive neural network for object recognition. CNN is used for extraction of lower features from images of RGB-D dataset followed by RNN forest for feature selection and classification. Similarly, Bezerra et al. [20] integrated a multi-dimensional recurrent neural network (MDRNN) with SVM classifiers to improve the character recognition rates. In [21], Chen et al. proposed T-RNN (transferred recurrent neural network). The authors extracted visual features using CNN and detected fetal ultrasound standard planes from ultrasound videos reporting very promising results. In a later study [22], the authors combined a fully convolutional network (FCN) and recurrent network for segmentation of 3D medical images. The proposed technique was evaluated on two databases and realized promising results.

Accurate sequence labeling and learning is one of the most important tasks in any recognition system. The sequence labeling needs not only to learn the long sequences but also to distinguish similar patterns from one another and assign labels accordingly. Hidden Markov models (HMM) [23], Conditional Random Field (CRF) [6], Recurrent Neural Network (RNN) and variants of RNN (BLSTM and MDLSTM) [4], [24], [25], [26] have been effectively applied to different sequence learning based problems. A number of studies [27], [28], [29], [30] have concluded that LSTM outperforms HMMs on such problems.

This paper presents a new convolutional–recursive deep learning model which is a combination of CNN and MDLSTM. The proposed model is mainly inspired from the one presented by Raina et al. [31] and is applied to solve character recognition problem on Urdu text in the Nastaliq script. The proposed system employs CNN for automatically extracting lower level features from a large MNIST dataset. The learned kernels are then convolved with text line images for extraction of features while the MDLSTM model is used as the classifier. Each (complete) text-line image is fed as a sequence of frames denoted by X=(x1,x2,,xi) with its corresponding target sequence denoted as T=(t1,t2,,tj). The input sequence of frames (X) is the set of all input character symbols from the text line images and target sequence is a set of all sequence of alphabets of labels (L) in ground truth or transcription file, i.e., T=L*. The size of target sequence set (T) is less than or equal to input sequence set (X), i.e., |T| ≤ |X|.

Let the data sample be composed of sequence pairs (X, T) taken from the training set (S) independently from the fixed distribution of both sequences DX × T. The training set (S) is used to train the sequence labeling algorithm f: XT and then assign labels to the character sequence of the test set (S′) having sample distribution (S′ ∈ DX × T). The label error rate (Errorlbl) is computed as follows. Errorlbl=1T(X,T)SED(h(X),T)where ED(h(X), T) is the edit distance between the input character sequence (X) and the target sequence (T) and is employed to compute the error rates.

The main contributions of this study include:

  • Demonstration of how convolutional–recursive architectures can be used to effectively recognize cursive text which forbids traditional feature learning due to the large number of classes/recognition units involved.

  • Addressing the challenge of learning feature extraction from a huge number of ligature classes (over 20,000 in Urdu) by proposing a novel transfer learning mechanism in which representative features are learned from only a small set of classes.

  • Showcasing the generalization of the feature extractor by training it on isolated handwritten English digits and then applying it for cursive Urdu machine printed text recognition.

  • Evaluation performed on a benchmark UPTI dataset, thereby facilitating more informative future evaluations.

The rest of this paper is organized as follows. Section 2 details the proposed methodology of combining CNN and MDLSTM for character recognition. Experimental results along with a comparison with the existing systems are presented in Section 3 while Section 4 concludes the paper.

Section snippets

Convolutional–recursive MDLSTM based recognition system

In this section, we present the novel convolutional–recurisve deep learning technique proposed in this study. The proposed technique for recognition of Urdu text lines relies on machine learning based features extracted using the CNN. Features are learned using the MNIST digit database [32]. The first convolutional layer of the CNN learns generic features from images of digits. These features are then computed for Urdu text lines and are fed to the MDLSTM for learning higher level transient

Results and comparative analysis

Table 4 compares the performance of the proposed technique with the existing systems evaluated on the UPTI database. These include implicit segmentation based approaches [38], [39], [40], [41] and the segmentation free approach using context shape matching technique presented in [33].

The meaningful comparisons of our system are possible with the work of Ul-Hassan et al. [38] and Ahmed et al. [39] where the authors employed BLSTM on raw pixels. Ul-Hasan et al. [38] reported an error rate of

Conclusion

We proposed a convolutional–recursive deep learning model based on a combination of CNN and MDLSTM for recognition of Urdu Nastaliq characters. The CNN is used to extract low level translational invariant features and the extracted features are fed to MDLSTM. The MDLSTM extracts high order features and recognizes the given Urdu text line image.The combination of CNN and MDLSTM proved to be an effective feature extraction method and outperformed the state of the art systems on a public dataset.

Saeeda Naz an Assistant Professor by designation and Head of Computer Science Department at GGPGC No.1, Abbottabad, Higher Education Department of Government of Khyber-Pakhtunkhwa, Pakistan, since 2008. She did her Ph.D. in Computer Science from Hazara University, Department of Information Technology, Mansehra, Pakistan. She has published two book chapters and more than 30 papers in peer reviewed national and international conferences and journals. Her areas of interest are Optical Character

References (41)

  • J. Donahue et al.

    Long-term recurrent convolutional networks for visual recognition and description

    Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2015)
  • MaoQ. et al.

    Learning salient features for speech emotion recognition using convolutional neural networks

    IEEE Trans. Multimed.

    (2014)
  • Q.A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Advances in Neural Information Processing Systems

    (2012)
  • P. Sermanet et al.

    Convolutional neural networks applied to house numbers digit classification

    Proceedings of the 2012 IEEE International Conference on Pattern Recognition (ICPR)

    (2012)
  • S. Pan et al.

    A discriminative cascade CNN model for offline handwritten digit recognition

    Proceedings of the 2015 IEEE IAPR International Conference on Machine Vision Applications (MVA)

    (2015)
  • D.C. Ciresan et al.

    Convolutional neural network committees for handwritten character classification

    Proceedings of the 2011 IEEE International Conference on Document Analysis and Recognition (ICDAR)

    (2011)
  • K. Soomro et al.

    UCF101: A dataset of 101 human actions classes from videos in the wild

    (2012)
  • P. Young et al.

    From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions

    TACL

    (2014)
  • LinP.D.T.-Y. et al.

    Microsoft COCO: common objects in context

    Proceedings of the 2014 European Conference on Computer Vision (ECCV)

    (2014)
  • R. Socher et al.

    Convolutional–recursive deep learning for 3d object classification

    Advances in Neural Information Processing Systems

    (2012)
  • Cited by (99)

    • Printed Ottoman text recognition using synthetic data and data augmentation

      2023, International Journal on Document Analysis and Recognition
    View all citing articles on Scopus

    Saeeda Naz an Assistant Professor by designation and Head of Computer Science Department at GGPGC No.1, Abbottabad, Higher Education Department of Government of Khyber-Pakhtunkhwa, Pakistan, since 2008. She did her Ph.D. in Computer Science from Hazara University, Department of Information Technology, Mansehra, Pakistan. She has published two book chapters and more than 30 papers in peer reviewed national and international conferences and journals. Her areas of interest are Optical Character Recognition, Pattern Recognition, Machine Learning, Medical Imaging and Natural Language Processing.

    Arif Iqbal Umar was born at district Haripur Pakistan. He obtained his M.Sc. (Computer Science) degree from University of Peshawar, Peshawar, Pakistan and Ph.D. (Computer Science) degree from BeiHang University (BUAA), Beijing PR China. His research interests include Data Mining, Machine Learning, Information Retrieval, Digital Image Processing, Computer Networks Security and Sensor Networks. He has at his credit 22 years’ experience of teaching, research, planning and academic management. Currently he is working as Assistant Professor (Computer Science) at Hazara University Mansehra Pakistan.

    Riaz Ahmad is a Ph.D. student in Technical University at Kaiserslautern, Germany. He is also a member of Multimedia Analysis and Data Mining (MADM) research group at German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany. His Ph.D. study is sponsored by Higher Education Commission of Pakistan under Faculty Development Program. Before this, he has served as a faculty member at Shaheed Benazir Bhutto University, Sheringal, Pakistan. His areas of research include document image analysis, image processing and Optical Character Recognition. More specifically, his work examines the invariant approaches against scale and rotation variation in Pashto cursive text.

    Imran Siddiqi is received his Ph.D.in Computer Science from Paris Descartes University, Paris, France in 2009. Presently, he is working as an Associate Professor at the department of Computer Science at Bahria University, Islamabad, Pakistan. His research interests include image analysis and pattern classification with applications to handwriting recognition, document indexing and retrieval, writer identification and verification and, content based image and video retrieval.

    Saad Bin Ahmed is serving as Lecturer at King Saud bin Abdulaziz University for Health Sciences, Saudi Arabia. He is completed his Master of computer sciences in intelligent systems from University of Technology, Kaiserslautern, Germany and has been served as research assistant at Image Understanding and Pattern Recognition (IUPR) research group at University of Technology, Kasierslautern, Germany. He had served as Lecturer at COMSATS institute of information technology, Abottabad, Pakistan and Iqra University, Islamabad, Pakistan. He has also performed his duties as project supervisor at Allama Iqbal Open University, Islamabad, (AIOU) Pakistan. His area of interests is document image analysis, medical image processing and optical character recognition. He is in field of image analysis since 10 years and has been involved in various pioneer research like handwritten Urdu character recognition.

    Imran Razzak is working as Associate Professor, Health Informatics, College of Public Health and Health Informatics, King Saud bin Abdulaziz University for Health Sciences, National Guard Health Affair, Riyadh Saudi Arabia. Besides, is associate editor in chief of International Journal of Intelligent Information Processing (IJIIP) and member of editorial board of PLOS One, International Journal of Biometrics Indersciences, International Journal of Computer Vision and Image Processing and Computer Science Journal, as well as scientific committee of several conferences. He is a writer of one US/PCT patent and more than 80 research publications in well reputed journals and conferences. His research area/field of expertize includes health informatics, image processing and intelligent system.

    Dr. Faisal Shafait is working as the Director of TUKL-NUST Research & Development Center and as an Associate Professor in the School of Electrical Engineering & Computer Science at the National University of Sciences and Technology, Pakistan. He has worked for a number of years as an Assistant Research Professor at The University of Western Australia, Australia, a Senior Researcher at the German Research Center for Artificial Intelligence (DFKI), Germany and a visiting researcher at Google, CA, USA. He received his Ph.D. in Computer Engineering with the highest distinction from TU Kaiserslautern, Germany in 2008. His research interests include machine learning and computer vision with a special emphasis on applications in document image analysis and recognition. He has co-authored over 100 publications in international peer reviewed conferences and journals in this area. He is an Editorial Board member of the International Journal on Document Analysis and Recognition (IJDAR), and a Program Committee member of leading document analysis conferences including ICDAR, DAS, and ICFHR. He is also serving on the Leadership Board of IAPRs Technical Committee on Computational Forensics (TC-6) and as the President of Pakistani Pattern Recognition Society (PPRS).

    View full text