1 Introduction

In handwriting recognition, the choice of the features used for representing letters and digits is crucial for achieving satisfactory performances. The aim is to address the problem of diversity in style, size, and shape, which can be found in handwriting produced by different writers [12]. This has led to the development of a large variety of feature sets, which are becoming increasingly larger in terms of number of attributes [11, 13]. Unfortunately, feature representations involving too many features may cause performance problems, especially in presence of noisy or redundant features. The former causing unsatisfactory classification performances, and the latter increasing the computational cost of the classification process of unknown samples. As for this cost, it may be problematic in handwriting recognition applications where time constraints are very strict, such as, for example, postal sorting systems. For these reasons, feature selection techniques, which select a subset of relevant features from the available ones, have been widely used to improve the performance of handwriting recognition systems [1,2,3,4,5]. Feature selection problems are characterized by a large search space, since the total number of possible solutions is \(2^n\) for problems involving n features. As a consequence, the exhaustive search for the optimal subset is impracticable in most of the cases. For this reason, many search techniques have been applied to feature selection, such as heuristic or greedy algorithms [14].

Such algorithms typically require both a search strategy for selecting feature subsets, and an evaluation function to measure the effectiveness of each selected feature subset. As for the evaluation functions, they can be divided into two broad classes filter and wrapper [14]. Wrapper approaches use the performance of a given classifier as evaluation function; this leads to high computational costs when a large number of evaluations is required, especially when large datasets are involved. Filter evaluation functions, instead, are independent of any classification algorithm and, in most cases, are faster and more general than wrapper ones.

Less computationally expensive approaches, adopt evaluation functions that measure the effectiveness of each single feature in discriminating samples belonging to different classes. Once the available features have been evaluated, the subset search procedure is straightforward: the features are ranked according to their merit and the best M features, where M must be set by the user, are selected. These approaches are typically very fast but cannot take into account the interactions that may occur between two or more features.

In this context, the aim of our work was twofold: on one hand, we tried to overcome some of the above mentioned drawbacks by adopting a feature ranking based technique for choosing the feature subset able to provide the best classification results. On the other hand, we considered one of the most effective and widely used set of features in handwriting recognition [11], to verify if it is possible to improve the classification results for handwriting recognition by using a reduced set of features. We also characterized the features that exhibit higher discriminant power among the three feature groups defined in the above feature set, namely the concavity, the contour information and the character surface. In our experiments, performed on two real-world datasets, we found that, in the large majority of cases, the character surface features were included in the best subsets.

As regards the feature ranking, we considered different univariate measures, each producing a different ranking according to a criterion that evaluates the effectiveness of a single feature in discriminating samples belonging to different classes. In the experiments we have also compared the performance of our method with that achieved by other effective and widely used search strategy: the results confirmed the effectiveness of our approach.

The remainder of the paper is organized as follows: in Sect. 2 we will illustrate the considered features sets, in Sect. 3 we will describe the feature evaluation methods, while the experimental results will be illustrated in Sect. 4. Conclusions will be eventually left to Sect. 5.

2 The Considered Feature Set

The feature set taken into account measures three properties of a segmented image representing an input sample, related to the concavity, to the contour and to the character surface [11]. The image is divided into 6 zones arranged on three rows and two columns. For each zone, 13 concavity measurements are computed using the 4-Freeman directions as well as other 4 auxiliary directions, totaling 78 concavity features (13 measurements \(\times \) 6 zones), normalized between 0 and 1. Then, in each zone, 8 contour features are extracted from a histogram of contour direction obtained by grouping the contour line segments between neighboring pixels based on the 8-Freeman direction. Therefore, there are 48 contour features for each image, normalized between 0 and 1. Finally, the last part of the feature vector is related to the character surface, where the number of black pixels in each zone is counted and normalized between 0 and 1, thus obtaining 6 values for each image. Summarizing, the total number of feature is \(78+48+6=132\).

3 Feature Evaluation

As anticipated in the introduction, our method requires an univariate measure to rank the features. In this study, we have considered five standard univariate measures, namely Chi-square [10], Relief [9], Gain ratio, Information Gain and Symmetrical uncertainty [8]. Each univariate measure ranks the available features according to a criterion, which evaluates the effectiveness in discriminating samples belonging to different classes.

The Chi-Square (CS) measure estimates feature merit by using a discretization algorithm on the CS statistic. For each feature, the related values are initially sorted by placing each observed value into its own interval. The next step uses the Chi-square statistic CS to determine whether the relative frequencies of the classes in adjacent intervals are similar enough to justify the merge.

The second considered measure is the Relief (RF), which uses instance-based learning to assign a relevance weight to each feature. The assigned weights reflect the feature ability to distinguish among the different classes at hand. The algorithm works by randomly sampling instances from the training data. For each sampled instance, the nearest instance of the same class (nearest hit) and different class (nearest miss) are found. A feature weight is updated according to how well its values distinguish the sampled instance from its nearest hit and nearest miss. Feature will receive a high weight if it differentiates between instances from different classes and has the same value for instances of the same class.

The last three considered univariate measures, are based on the well known information-theory concept of entropy H(X), which can be used to estimate the uncertainty of the random variable X. This concept can be extended defining the conditional entropy H(X|Y), which represents the randomness of X when the value of Y is known. These quantities can be used to define the information gain (IG) concept:

$$ IG=H(C)-H(C|X) $$

IG represents the amount by which the entropy of C decreases when X is given and reflects additional information about C provided by the feature X.

The last two considered univariate measures uses the information gain defined in 3. The first one, called Gain Ratio (GR), is defined as the ratio between the information gain and the entropy of the feature X to be evaluated:

$$ GR= IG/H(X) $$

Finally, the last univariate measure taken into account, called Symmetrical Uncertainty (IS), compensates for information gain bias toward attributes with more values and normalizes its value to the range [0, 1]:

$$ IS = 2.0\cdot IG/(H(C)+H(X)) $$

To compare our results with those attainable by other searching algorithms defined in the literature, we have considered the well-known Best First (BF) search strategy, combined with two different criteria for feature evaluation, namely the Consistency Criterion [10], and the Correlation-based Feature Selection criterion [8].

The Consistency Criterion (CC) provides an effective measure of how well samples belonging to different classes are separated in a feature sub-space, while the Correlation-based Feature Selection criterion (CFS) uses a correlation based heuristic to evaluate feature subset worth. This function takes into account the usefulness of individual features for predicting class labels along with the level of inter-correlation among them. The idea behind this approach is that good subsets contain features highly correlated with the class and uncorrelated with each other. Denoting with X and Y two features, their correlation \(r_{XY}\) is:

$$ r_{XY} = 2.0\cdot (H(X)+H(Y)-H(X,Y))/(H(X)+H(Y)) $$

Given a feature selection problem in which the patterns are represented by means of a set Y of N features, the CFS function computes the merit of the generic subset \(X \subseteq Y\), made of k features, as follows:

$$ f_{CFS} = k \overline{r_{cf}}/\sqrt{k+k(k-1)\overline{r_{ff}}} $$

where \(\overline{r_{cf}}\) is the average feature-class correlation, and \(\overline{r_{ff}}\) is the feature-feature correlation. Note that the numerator estimates the discriminative power of the features in X, whereas the denominator assesses the redundancy among them.

4 Experimental Results

In order to assess the effectiveness of the proposed approach in handwritten character recognition problems, we considered two real-world databases, namely the well-known NIST-SD19 and the RIMES databases.

NIST [7] contains binary images representing alphanumeric characters. We have considered handwritten uppercase and lowercase letters (52 classes). The handwriting sample form hsf4, containing 23941 characters (11941 uppercase and 12000 lowercase), was merged with the form hsf7, containing 23670 characters (12092 uppercase and 11578 lowercase) and used as a unique database of 47611 samples. In each form, characters are isolated, labeled and stored in \(128 \times 128\) pixel images.

RIMES is a publicly available database containing real-world handwritten words and it has been largely adopted for performance evaluation of handwriting recognition systems [6]. The 4047 word images of RIMES were processed in order to extract sub-image containing connected component of ink, which were labeled by six human experts. At the end of this process, from the 9869 labeled samples, a subset of 4768 samples, corresponding to isolated characters has been extracted and used for our experiments.

Fig. 1.
figure 1

Recognition rates on NIST as function of the number of selected features.

We performed two sets of experiments and evaluated the effectiveness of the selected feature subsets by using two well-known and widely used different classification schemes, namely K-Nearest Neighbor (K-NN) and Bagging.

In this first set of experiments, we applied the methods for feature evaluation illustrated in Sect. 3. To illustrate our experimental setup, let us first consider the database NIST. We applied the univariate measures illustrated in Sect. 3 to these data, obtaining 5 different feature rankings. Let us consider the ranking provided by the first univariate measure, namely CS: by using this feature ranking, we generated different representations of the database NIST, each containing an increasing number of features. More specifically, we generated 15 datasets in the following way: in the first one, NIST samples were represented by using the first 5 features in the ranking, in the second one by using the first 10 features, in the third one the first 15 features and so on, adding each time the successive 10 features in the ranking. In the last dataset, NIST samples were represented by using all the 132 available features. The same procedure was repeated for the other univariate measures applied to the database NIST. Similarly, this procedure was repeated for RIMES. Summarizing, for each database, we obtained 5 different feature rankings, each used to generate 15 different sets of data with an increasing number of features. Each of them was used in the experiments for evaluating the obtainable classification results.

As regards the classification process, we considered the two classification schemes mentioned above, using a 10-fold validation strategy and performed 20 runs for each experiment. The results reported in the following were obtained by averaging the values over the 20 runs.

Figure 1 shows the average results on NIST. In each plot, the x-axis reports the number of features used to represent the input samples, while the y-axis reports the corresponding classification results, in terms of recognition rate. Similarly, Fig. 2 shows the results on RIMES.

Fig. 2.
figure 2

Recognition rates on RIMES as function of the number of selected features.

Table 1. Best Recognition Rates (RR) and the related number of features (NF).

It is interesting to note that, accepting a reduction of the recognition rate of about 5% with respect to its maximum value, it is possible to select a very small subset of features, namely the first 30 ones in the rankings, strongly reducing the computational complexity of the classification problem. The plots in the figures also show that, using the first 60 features in the rankings (i.e. less than 50% of all the available ones), the reduction of the recognition rate is less than 2%.

For sake of clearness, we have also summarized the obtained classification results in Table 1. The first row in the table shows the recognition rate (RR) obtained with all the available features, while the other ones show the best RR for each feature ranking, together with the corresponding number of selected features (NF). The last two rows of the table show the RR obtained with CFS and CC feature selection methods and the corresponding NF. The data in the table show that, in the large majority of the cases, the recognition rates obtained with our feature selection method outperform those obtained by using the other considered feature subset selection methods, as well as those obtained by using all the available features. Moreover, the best results are always achieved using a number of features significantly smaller than 132. In the average, the number of features allowing us to obtain the best results is about 90, i.e. about 70% of the total number of features. Finally, as regards the performance of the subset feature selection methods, CFS provides recognitions slightly worse, but using in the average a smaller number of feature. CC feature selection method, instead, performs significantly worse than all the other methods, selecting in the average a too small number of features.

In the second set of experiments, we analyzed the discriminant power of the groups of features described in Sect. 2 (concavity, contour and character surface). To this aim, we computed the percentages of features of each group that: (i) have been included in the best feature subsets obtained by using the five univariate measures taken into account; (ii) were selected by the two feature subset evaluation methods. The results, shown in Fig. 3, indicate that the features representing contour information and those representing character surface information have very high discriminant power and are almost always selected. On the contrary, the features associated to concavity information, whose number is higher than that of the other categories, seem to be less distinctive and, in most of the cases, more than 50% of such features have been discarded.

Fig. 3.
figure 3

Percentage of groups (concavity, contour and surface) included in the best feature subsets.

5 Conclusions

We have proposed a feature selection method which uses a feature ranking based technique for choosing feature subsets able to provide high classification results. We have used a widely used set of features in handwriting recognition for representing samples of two real-world databases. The experimental results confirmed that it is possible to choose a reduced set of features without affecting the overall classification rates. The results have also shown that it is possible to strongly reduce the number of features, accepting a very limited reduction of the best recognition rate (less than 2%). This opportunity may be very helpful in case of applications where keeping low the computational cost of the classification system is crucial. We have also characterized the type of features that exhibit higher discriminant power among the whole set of available ones.