1 Introduction

Classification methods have evolved significantly in recent years, specially with advances in deep learning, and very often such methods are beating state-of-the-art benchmark results, where flat classification has been the most used type of classification [6]. With the increase popularity of Deep Learning and Representation Learning, it is expectable that instead of dealing with the N-class problem by learning a single representation for all classes, more specific representations could be learned for better dealing with inter-class confusion, so hierarchical classification could be better alternative to tailor Convolutional Neural Network (CNN) architectures [6].

Although some hierarchical approaches rely on domain knowledge to be built, the process of creating the hierarchy in automated way has been investigated. One way to do so is by exploiting the information presented by a confusion matrix [3, 5, 6]. That is, after a classifier has been trained, the confusion matrix produced by this classifier on a validation set could be used to find which classes present some confusion in the classification, and then a more specialised classification structure could be generated. Some works have exploited this idea, but there is a lack of a better investigation of the impact of the different methods that are involved in the stages of such process, from which we select three stages that we judge as relevant: (a) the way the confusion matrix is represented and evaluated, i.e. whether it is the raw matrix or after some transformation applied onto it [6]; (b) the metric that is used to compute (dis-)similarity between the classes, that could be the Euclidean distance, Pearson correlation, a cluster algorithm, and thus forth; and (c) the impact of the base classifier itself, whether a more or less accurate classifier can change the final generation of the hierarchical structure and the resulting accuracy.

The aim of this work is to present an investigation on different methods that could be employed to make a hierarchical classification structure, by taking into account the three aforementioned stages. By considering three character-recognition datasets from the Extended MNIST (EMNIST) repository, i.e. Digits, Letters and Balanced, we provide not only a quantitative analysis, by presenting the impact on the accuracy on the test set, but also a qualitative analysis given the easiness of associating class confusions with the shape of the characters.

2 Methodology

The main framework for hierarchical classification consists creating binary verifiers, such as the ones that proved to be efficient for handwriting recognition problems [4]. In greater detail, during the training phase, given the original N-class problem, we first train an N-class flat classifier C, using a base classifier of type B, which we apply on the validation setFootnote 1 and compute a normalised confusion matrix CM, where the columns sum up to one. Then, \(CM'\) is computed by applying a transformation T on CM, and next, the ranking R with the pairs of classes with the highest level of confusion is computed, by considering the similarity metric S. Afterwards, by making use of a base classifier of type \(B'\), we train a set of M 2-class verifiers \(V = \{ v_1, \ldots , v_M \}\), corresponding to the M highest positions in R, and associate then to the set of confusing classes \(CC = \{ cc_1, \ldots , cc_M \}\), where \(cc_i = \{ class_j, class_k \}\) and \(j \ne k, j \le N, k \le N\). For the test phase, for each test sample x, the method consists of using a verifier from V whenever the top-two predictions \(pred_1\) and \(pred_2\) are equal to some pair of classes in CC, and the likelihood of \(pred_1\) is below the confidence level \(\theta \). If that is the case, the prediction of the selected verifier \(v_i\) is used for the final prediction.

2.1 Transformations on the Confusion Matrix

We take into account three different transformations T that can be applied onto CM to generate \(CM'\). The first one is the raw matrix itself, i.e. no transformation at all and \(CM' = CM\). The other two we describe below.

The second approach is based on the transformation used in [6], to which we refer to as the distance matrix (DM). In this method, the confusion matrix is first converted to the so-called distance matrix D, where \(D = 1 - CM\) and the elements in the diagonal of D are set to zero. Next, D is converted to symmetric matrix with \(D = 0.5 * (D + D^T)\), where entry \(D_{ij}\) is supposed to represents how easy it is to discriminate categories i and j. In this case, \(CM' = D\).

In addition, we propose a method to which we refer as the penalty matrix (PM), where the idea is to create a matrix that penalises the pairs of classes that present more inter-class errors. In greater detail, giving the normalised confusion matrix CM, we initialise the penalty matrix P with zeros. Then, for each entry \(CM_{ij}\) where \(i \ne j\), i.e. an entry in CM that represents a classification mistake, we increment the entries \(P_{ij}\), \(P_{ji}\), \(P_{ii}\), and \(P_{jj}\), with \(CM_{ij}\). We expect that the classes i and j with the highest level of confusion to generate grater values in P. Note that by incrementing both \(P_{ij}\) and \(P_{ji}\) at the same time will result in a symmetric matrix, similar to DM. And by setting \(P_{ii}\) and \(P_{jj}\) with the same values, we aim to penalise the classes with highest mistake. In the end of the day, in the ideal case we expect \(P_{ij}\), \(P_{ji}\), \(P_{ii}\), and \(P_{jj}\) to have similar values.

2.2 Similarity Metrics

In order to compute the similarity of two classes, denoted class i and class j, we apply similarity metrics on the columns of the transformed confusion matrix \(CM'\). That is, let \(cm'_i\) be the column of class i and \(cm'_j\) the column for class j, to compute the similarity \(sim_{ij}\) between classes i and j we make use of the similarity function S as:

$$\begin{aligned} sim_{ij} = S( cm'_i, cm'_j ). \end{aligned}$$
(1)

For implementing S, we consider two different metrics. The first one is the well-know Euclidean distanceFootnote 2, defined as:

$$\begin{aligned} S( cm'_i, cm'_j ) = \sqrt{ \sum _{k=1}^{N} (cm'_{i} - cm'_{jk})^2} \end{aligned}$$
(2)

Given that in the Euclidean space different pairs of points can present the same distance even if they are in completely differente locations in the space, of even at different angles, we consider the Person correlation coefficiente in order to capture a more precise correlation between each pair of cells \(cm'_{ik}\) and \(cm'_{jk}\). In this case, the similarity metric is defined as:

$$\begin{aligned} S( cm'_i, cm'_j ) = \frac{ cov(cm'_i, cm'_j)}{\sigma (cm'_i) \sigma (cm'_j)}, \end{aligned}$$
(3)

where \(cov(cm'_i, cm'_j)\) is the covariance between populations \(cm'_i\) and \(cm'_j\), \(\sigma (cm'_i)\) is the standard deviation of \(cm'_i\), and \(\sigma (cm'_j)\) is the standard deviation of \(cm'_j\).

2.3 Base Classifiers

For the base classifiers B and \(B'\) (in this work \(B' = B\)), we consider three different types of classifiers from the family of neural networks, ranging from less to more complex approaches, which we expect to vary significantly in terms of accuracy in flat classification. The three methods we evaluate are Logistic Regression (LR), Multi-Layer Perceptron (MLP) Neural Networks, and Convolution Neural Networks (CNN).

Both the LR and the MLP classifiers are standard models, trained with stochastic gradient descent algorithm, where for the MLP we set one hidden layer with 1,000 neurons. For the CNN, we consider a network comprising the following architecture: a layer with 32 convolution filters with size 3 by 3; another layer of 64 convolution filter with size 3 by 3; a max pooling layer with size 2 \(\times \) 2; a dropout layer with rate set to 0.25; a dense layer with 128 neurons; another dropout layer with rate set to 0.5; and the final activation layer with the softmax function. Note that for all convolution and dense layers, the rectified linear unit activation function is used.

3 Experiments

For the experimental evaluation of the proposed methodology, we consider datasets from the extended MNIST (EMNIST) repositoryFootnote 3 [2]: Digits, with 10 classes, and 240,000 samples for training and 40,000 samples for test; Letters, 26 classes with 124,800 samples for training and 20,800 samples for test; Balanced, 47 classes, with 112,800 samples for training and 18,800 samples for test.

Given this protocol, the first step of our evaluations consisted of evaluating the different combinations of transformation methods for T and similarity metrics for S. In this case, we consider four different combinations:

  • ED, with \(CM' = CM\) and the Euclidean distance for S;

  • PC, also with \(CM' = CM\) but with Pearson correlation for S;

  • DM, with DM for T and Pearson correlation for S;

  • PM, with the PM method for T and Pearson correlation for S.

For these experiments, we fixed the base classifier B to MLP in order to have at the same time a simple and fast approach, compared with a CNN, but a more accurate one, as we believed, compared with LR. In addition, M was set to 10 for all experiments. The comparison of the different methods, in terms of recognition accuracy on the test set, are presented in Table 1.

On Digits, the proposed hierarchical methods does not seem to improve the recognition rates. With the exception of the ED method, the application of the other method resulted in lower accuracies. It is curious that PC, DM and PM performed identically in terms of accuracy, but the number of samples selected for verification by DM was much smaller than those of PC and PM. That indicates the DM might not be the best option to detect confusion between classes. On Letters and Balanced datasets, on the other hand, improvements in accuracy were show by all methods. In this case though, ED performed slightly worse than PC and PM, but still better than DM. Overall, it seems that PM might be a better method for this problem, since it has been able to reach the highest accuracies on both Letters and Balanced.

To complement this analysis, we also present in Table 1 the result of PM*, which consists of the PM method but with the threshold \(\theta \) optimized with the validation set. With such optimization, only on Digits a value below 1.00 was found. In that case, the final accuracy was a little higher than with \(\theta = 1.00\), but still lower than the flat classifier.

Table 1. Accuracy of the MLP classifier with the different methods, on all datasets, where: %Flat: Accuracy with flat classifier; %Hier: Accuracy with hierarchical classifier; #Ver(%): total of verified samples, with the percentage between parentheses; %BV: accuracy on verified samples before verification, i.e. with flat classifier; %AV: accuracy on verified samples after verification, i.e. with verifiers.

In Table 2 we present the rankings of confusing classes generated by each method. We can observe that the PC and PM methods tend to generate the most similar rankings, with an intersection of 9 elements in Digits and 7 elements in Letters and Balanced. Qualitatively we may claim that the ranking of those two methods make a lot of sense, since the most similar classes found by the methods are characters with similar shapes, such as 4 and 9, 3 and 8, I and L, F and f, and thus forth. The ED and DM methods, on the other hand, resulted in some odd elements in the rankings, such as 0 and 1, P and T, which are characters with very different shapes.

Table 2. Ranking of confusing classes with the four methods for transformation and similarity, with multi-layer perceptron neural networks.

Given that PM resulted in the highest accuracies in the previous experiments, we present the evaluation of the LR and CNN classifiers only with that approach. The results of LR, in terms of accuracy, are presented in Table 3, and those of the CNN are presented in Table 4.

With LR classifier we observe much lower accuracies with the flat classifier, compared with MLP (see Table 1). That said, it was expected that with a classifier with a higher error would benefit better from the hierarchical classification, and that was the case for both Digits and Letters. In the former the gain was of about 0.96 % points, and in the former of about 0.68 % points. The MLP classifier had a decrease of 0.02 % points and a gain of 0.27 % points in the same datasets. In the Balanced dataset, even with higher error, the use of LR classifiers did not resulted in a higher increase of accuracy, which we suspect is due to not being able to generate good verifiers for this problem. In addition, we observe that optimizing the parameter \(\theta \) does not bring any gain. In fact, it resulted in lower accuracies in two datasets.

Table 3. Accuracies with the Logistic Regression classifier.

With CNNs we also observe the expected behavior of reaching the highest accuracies for all problems, and the smaller impact of the verifiers in improving the flat classification. In this case, improvements can only be observed if \(\theta \) is optimized. In Letters, if \(\theta \) is optimized, an increase of 0.14 PP was observed. In Digits and Balanced, only a increase of 0.01% points and 0.02% points was achieved. In our opinion, this does not mean that the idea of hierarchical classification is not promising for deep learning, which might be the most robust classifier in some problems, but these approaches might need some tailoring in the CNN’s architecture for better dealing with 2-class problems. Transfer learning from the flat classifier can also be a way to deal with the smaller training set generated for each verifier. Furthermore, since we observe very high accuracy rates for both CNN and MLP in Digits and the hierarchical scheme can degrade such rates, the results indicate that it might not be worth to hierarchical since flat classification works very well for the problem.

Table 4. Accuracies with Convolutional Neural Networks

In Table 5 we present the rankings generated by the three type of base classifiers that we evaluated. By analysing the intersection of the rankings, the results show the type of base classifier can actually affect the generation of the ranking of confusing classes. However, the difference looks less pronounced when the complexity of the classification problem is higher, for instance the Balanced dataset where the methods presented an intersection of at least 7 elements in their rankings. In Digits, the intersections ranged from 4 to 7 elements only, being MLP and LR the classifier with most intersection, which is somewhat a surprising result. That is, given that MLP and CNN present a much narrow gap in accuracy than that of LR and MLP, we were expecting the results of an MLP to be more similar to those of a CNN than of an LR. Nonetheless, MLP and CNN present higher intersection than LR and CNN, indicating that the latter are the classifiers that behave the least similar to each classifiers, as expected.

Table 5. Comparison of rankings generated by MLP, LR and CNN

4 Conclusions and Future Work

In this work we presented a deep analysis of different methods for automatically creating hierarchical classification from the confusion matrix of a flat classifier. The results demonstrated that transformation and similarity metric can greatly affect the way it is computed the list of classes with the highest inter-class confusion. As future work, we intent to conduct a better investigation in optimising classifiers’ architecture for both flat and hierarchical classification. In addition, we plan to conduct further evaluation on other datasets, such as the DTD dataset [1], where the inter-class confusion appears to be very challenging.