Keywords

1 Introduction and Related Works

The techniques in relation to the information processing at present cognizes hectic progress in relationship with data processing. It has an increasing potential in the domain of the human-computer interaction. Furthermore, in recent years, human reading’s machine simulation has been intensively subjected to many studies. The recognition of writing is part of the larger domain of pattern recognition. It aims at developing a system able to be the closest to the human ability of reading.

Arabic-handwriting languages are lagging behind mainly because of their complexity and their cursive nature. Consequently, automatic recognition of handwritten script represents a burdening work to be fulfilled. Since the late 1960s, by dint of its broad applicability in several engineering technological areas, Arabic handwritten script (AHS) recognition has been positively seen as the subject of in-depth studies [1]. A lot of studies have been realized to recognize Arabic handwritten characters using unsupervised feature learning and hand-designed features [2, 3].

Improving suitable characteristics from the image describes a difficult and complex chore. It really requires not only a skilled but also an experienced specialist in the domain of feature extraction methods like: MFCC features in speech area, Gabor and HOG features in computer vision. The choice and goodness of these hand-designed features makes it possible to identify the efficiency of the frames utilized for classification and recognition like Multi-layer Perceptron (MLP), Hidden Markov Model (HMM), Support Vector Machine (SVM), etc. However, the majority of classifiers meet a major problem which lies in the variability of the vector features size. Thereby, many researchers have targeted the use of raw or untagged data in training developed handwriting systems, as they are the easiest way to handle large data.

The ability to automatically extract features and model high-level abstraction in various signals, namely image and text, has made deep learning (DL) algorithms widespread in the world of Artificial Intelligence research. Therefore, our first ongoing study is to implement a system for automatic feature extraction that is richer than the one obtained by employing heuristic signal processing based on the knowledge domain. This approach depends on the notion of in-depth learning of a representation of Arabic script from the image signal. So as to carry out that, the use of unsupervised and supervised learning methods has shown some potential. Learning such representations is likely to be applied to various handwriting recognition tasks.

Recent research has shown that DL methods have made it possible to make decisive progress in solving tasks such as object recognition [4, 5], computer vision [6], speech recognition [7, 8] and Arabic handwriting recognition [9].

Elaborate by LeCun et al. [10], Convolutional Neural Network (CNN) is a specialist type of Neural Network (NN) that automatically learning favorable features at every layer of the architecture based on the given dataset, which can be a convolution layer, a pooling layer and a fully connected layer. Then Ranzato et al. [11] improved performance by using unsupervised pre-training on a CNN.

Another classifier which is employed extensively is Deep Belief Network (DBN) [12]. DBN is one of the most classical deep learning models, composed of several Restricted Boltzmann Machines (RBM) in cascade. This model learns representations of high-level features from unlabeled data that uses unsupervised learning algorithms.

In comparison to shallow learning, the pros of DL are that deep structures can be designed to learn internal representation and more abstract details of input data. However, the high number of parameters given can also lead to another problem: over-fitting. Thus, improving or developing novel effective regularization techniques is an unavoidable necessity. In recent years, various regularization techniques have been suggested as batch normalization, Dropout and Dropconnect.

The participations of this paper are to leverage the DL approach to solve the problem of recognizing handwritten text in Arabic. To fulfill our target, we are studying the potential benefits of our suggested hybrid CDBN/SVM structure [13]; this model handled CDBN as an automatic characteristic extractor and let SVM to be the output predictor. On the other hand, to enhance the efficiency of CDBN/SVM model, regularization methods can contribute to the defense of over-fitting as Dropout and Dropconnect techniques.

This paper is organized as follows: Sect. 2 gives an overview of the basic components of Convolutional Deep Belief Network model and regularization techniques. Then, our target architectures are explored and discussed to recognize Arabic handwriting text. Section 3 describes experimental study, and Sect. 4 discusses the results. The last section concludes this work with some remarks.

2 Deep Models for Handwritten Recognition

In this section, the DBN model based on the RBM is firstly represented and after that, the CDBN model is reviewed. Just then, the effect of Dropout and Dropconnect techniques is analyzed in our CDBN architectures.

2.1 Restricted Boltzmann Machine (RBM)

DBN is a hierarchical generative model [12] involving several RBM layers [14, 15] consisting of a layer of observed units and multiple layers of hidden units. The link between the two upper layers of DBN is not oriented, the other links are oriented, and there is no connection for the units of the same layer. To initialize the weights of the network, Deep Belief Networks utilize a greedy layer by layer pre-trained algorithm.

An RBM is a non-oriented graphical model layer consisting of a two layers, in which the visible units ‘v’ are connected to the hidden units ‘h’. The joint probability distribution and the energy function are computed as:

$$ E(v,h) = - \sum\limits_{i,j} {\mathop v\nolimits_{i} } \mathop w\nolimits_{ij} \mathop h\nolimits_{j} - \sum\limits_{j} {\mathop b\nolimits_{j} } \mathop h\nolimits_{j} - \sum\limits_{i} {\mathop c\nolimits_{i} } \mathop v\nolimits_{i} $$
(1)
$$ P(v,h) = \frac{1}{Z}e^{ - E(v,h)} $$
(2)

Where wij is the weight between visible units i and hidden units j, bj is bias terms for hidden unit, ci is bias terms for visible unit and Z represents the partition function.

2.2 Convolutional Restricted Boltzmann Machine (CRBM)

The construction of hierarchical features structures is a challenge and the Convolutional Deep Belief Network is one of the famous features extractor often used in the last decade in the field of pattern recognition. In this subsection, we thoroughly clarify the basic notion of this approach.

As a hierarchical generative model [16], the Convolutional Deep Belief Network reinforces the efficiency of bottom-up and top-down probabilistic inference. Similar to the Deep Belief Network standard, this model made up of several layers of probabilistic max-pooling CRBMs stack on top of each other, and the training was carried out by the greedy layer-by-layer algorithm [12, 17]. This probabilistically decreases the representation of the detection layers. Decreasing the representation with max-pooling allows representations of the upper layer to never change to local translations of input data, reduces the computational load [18] and is useful for vision recognition issues [19].

Building a convolutional Deep Belief Network, the algorithm learns high-level features using end-to-end training. In our experiments, we trained CDBN architecture with a couple of CRBM layers to automatically learn hierarchical features in an un-supervised/supervised manner. Figure 1 clarifies the architecture of CRBM made up of two layers: a visible layer V and a hidden layer H, both joined by sets of local and common parameters. A detailed technical report is available at [20].

By using visible inputs with real values, the probabilistic max-pooling CRBM is fixed by the following equation:

$$ \begin{aligned} E(v,h) = & \,\frac{1}{2}\sum\limits_{i,j = 1}^{{\mathop N\nolimits_{V} }} {\mathop v\nolimits_{i,j}^{2} } - \sum\limits_{k = 1}^{{\mathop K\nolimits_{{}} }} {\sum\limits_{i,j = 1}^{{\mathop N\nolimits_{H} }} {\sum\limits_{r,s = 1}^{{\mathop N\nolimits_{W} }} {\mathop h\nolimits_{i,j}^{k} } } } \mathop w\nolimits_{r,s}^{k} \mathop v\nolimits_{i + r - 1,j + s - 1} \\ & \, - \,\sum\limits_{k = 1}^{K} {\mathop b\nolimits_{k} } \sum\limits_{i,j = 1}^{{\mathop N\nolimits_{H} }} {\mathop h\nolimits_{i,j}^{k} } - c\sum\limits_{i,j = 1}^{{\mathop N\nolimits_{V} }} {\mathop v\nolimits_{i,j} } \\ \end{aligned} $$
(3)
Fig. 1.
figure 1

Representation of a probabilistic max-pooling CRBM. NV and NH refer to the dimension of visible and hidden layer, and NW to the dimension of convolution filter.

2.3 Regularization Methods

The utilization of Deep Networks models for cursive handwriting recognition has made significant progress over the past decade. Nevertheless, for these architectures to be used effectively, a wide amount of data needs to be collected.

Consequently, over-fitting is a serious problem in such networks due to the large number of parameters that will be carried out gradually as the network increases and gets deeper. To overcome this problem, many regularization and data augmentation procedures have been ameliorated [21,22,23].

In this sub-section, two regularization techniques will be shortly introduced that may affect the training performance. Dropout and Dropconnect are both methods for preventing over-fitting in a neural network.

To practice Dropout, a subset of units are haphazardly selected and set their output to zero without paying attention to the input. This efficiently removes these units from the model. A Varied subset of units is selected randomly each time we present an example of training.

Dropconnect operates in the same way, excluding that we deactivate individual weights (i.e., fix them to zero), rather of nodes, so a node may stay partly active. In addition, Dropconnect is a generalization of Dropout as it generates yet more possible models, since there are practically still more links than units.

2.4 Model Settings

To extend our study [13] so that we can discover the power of the deep convolutional neural networks classifier done on the problem of AHS recognition, we point out in this work an itemized study of CDBN with Dropout/Dropconnect techniques. In this subsection, we identify the tuning parameters of the chosen convolutional DBN structure.

As noted above, our CDBN architecture is composed of two layers of CRBM (See Fig. 2). The efficiency of this architecture during IFN/ENIT’s handwritten text recognition task was evaluated.

The description of the CDBN architecture exploited in the experiments conducted in the IFN/ENIT database is given as follows: \( 1 \times 300 \times 100 - 12W24G - MP2 - 10W40G - MP2 \). This architecture corresponds to a network with dimension input images \( 300 \times 100 \), the initial layer consisting of 24 groups of \( 12 \times 12 \) pixel filters and the pooling ratio C for each layer is 2. The second layer includes 40 maps, each \( 10 \times 10 \). We define a sparseness parameter of 0.03. The initial layer bases learned strokes consisting of the characters, as for the second layer bases learned characters parts by the groups of strokes. By integrating the activations of the first and second layers, we constructed feature vectors; Support vector machines are used to rank these features.

In order to regularize and make the most effective use of these architectures, units or weights have been removed. Dropout was used only at the input layer with a probability of 20% and at each hidden layer at a probability of 50%, while Dropconnect was only applied at the input layer with a probability of 20%.

Fig. 2.
figure 2

Representation of the suggested CDBN structure with dropout.

3 Experiments with Proposed Model

This section illustrates a test to evaluate the suggested approach performance on the IFN/ENIT benchmark database [24]. In our experiments, each IFN/ENIT dataset image was normalized to the same input data dimension with 300 × 100 pixels for the visible layer. These textual images are at the gray level and resizing is not necessarily square.

Generally, script handwriting recognition system consists of three principal steps: pre-processing, automatic feature extraction and classification.

  • Pre-processing: This phase consists in generating a normalized and uniform text image.

  • Feature extraction: Consists in determining different feature vectors.

  • Training: The training phase consists to find the most appropriate models to the inputs of the problem.

  • Parameters setting: For configuration, it is a must to identify the number and size of filters, sparsity of the hidden units and max-pooling region size in each layer of the Convolutional DBN model. Referring to the size of the images used (high-dimensional data), we specify a hyper-parameters setting for the configuration of the Convolutional DBN structure. So, to get the most out use of this architecture, two regularization methods have been put into practice separately for the Convolutional DBN structure called Dropout and DropConnect.

3.1 Dataset Description and Experimental Setting

To measure the effectiveness of our system proposed for high-level dimension of data input image, the IFN/ENIT database [24] is employed. Indeed, the IFN/ENIT database comprises 26459 handwritten Arabic words developed with contributions from 411 volunteers, making a total of around 115420 parts of Arabic words (PAWs) and around 212167 letters. The words written are 946 Tunisian town and village names with the postal code of each. Data processing consists of offline handwritten Arabic words. Dataset ‘a’ and ‘b’ are employed for training phase whereas the test set was chosen from set ‘c’. Figure 3 illustrates samples of village name, written by 5 different writers.

Fig. 3.
figure 3

Samples from the IFN/ENIT data set.

3.2 Experimental Results and Comparison

Table 1 makes a comparison between our approach outcomes with those already published outcomes. We noted that the work of our CDBN structure yielded encouraging outcomes, with a Word Error Rate (WER) of around 9.76% if compared to Maalej and kherallah’s works [25] using Recurrent Neural Network (RNN), after applying Dropout. On the other hand, with Dropconnect we got an error rate of 14.09%.

In addition, the rate achieved is contrasted to our earlier work. These experiments clearly prove that the outcome in [13] reaches 16.3% using the Convolutional DBN structure without Dropout, which is not excellently contrasted to the classic approaches [26, 27]. It is thanks to the Convolutional DBN architecture that is able to be over-completed. On an experimental basis, a model that is too complete or too adjusted may be prone to learn inconsiderable solutions, such as pixel detectors. In our present work to find a suitable solution to this issue, we utilize two regularization techniques, namely Dropout and Dropconnect for Convolutional DBN. As a result, the acquired outcomes prove an amelioration rate of approximately 6.54% with Dropout and 2.21% with Dropconnect.

Table 1. Comparison of word recognition performances utilizing the IFN/ENIT database.

In general, it is evident that the proposed DL architecture, Convolutional Deep Belief Network with Dropout, provides satisfactory performance, specially against over others approaches such as the Dynamic Time Warping (DTW) and the Hidden Markov Model applied to the IFN/ENIT database.

4 Discussion

As mentioned above, our suggestion depicts a DL approach for Arabic Handwriting Script recognition, in particular the Convolutional DBN. To confirm the efficiency of the proposed framework, we introduced experimental outcomes utilizing Arabic words handwritten databases; IFN/ENIT database.

We are able to observe that our Convolutional DBN architecture with Dropconnect has reached a promising error rate of 14.09% when used with large dimension data. In addition, we have rebuilt our proposed Convolutional DBN setting with Dropout. The effectiveness is then raised to achieve a WER of 9.76%, which corresponds to a gain of 4.33%.

The results obtained, regardless of their size, are sufficiently important compared to scientific researches using other classification methods, in particular those they obtained with raw pixels without feature extraction phase (See Fig. 4). This participation portrays an interesting challenge in the field of computer vision and pattern recognition, as it will be a real incentive to motivate the use of deep machine learning with Big Data analysis.

Fig. 4.
figure 4

WER comparison utilizing IFN/ENIT Database.

5 Conclusion

With the development of DL technique, deep hierarchical neural network has drawn great attentions for handwriting recognition. In this article, we first introduced a baseline of the DL approach to Arabic Handwriting Script recognition, primarily the Convolutional Deep Belief Network. Our aim was to leverage the energy of these Deep Networks that can process large dimensions input image, permitting the usage of raw data inputs rather than extracting a feature vector and studying the complex decision boundary between classes. Secondly, we investigated the efficiency of two regularization methods employed separately in the Convolutional DBN structure to recognize Arabic words using IFN/ENIT Database. As we can observe, Dropout is a very efficient regularization technique compared to Dropconnect and the unregulated basic method.

In addition, as a perspective of ours studies, we will evaluate the performance of our system for various applications for the image processing, such as, biometric and medical images analysis.