Keywords

1 Introduction

Recently, hierarchical classification has received enough attention in the field of machine learning [19, 30, 38, 39], and also has been applied successfully in many applications [3, 9, 40]. In general, hierarchical classification has three advantages: (1) Hierarchical classification has higher classification efficiency. In the testing phase, hierarchical classifier only need to go through fewer node classifiers than flat classifiers [6, 37]. (2) Hierarchical classification can effectively deal with the imbalanced data. (3) The structural characteristics of the hierarchical classifier make it possible to obtain higher classification accuracy when dealing with structured data. For hierarchical learning, one open problem is how to build a reasonable hierarchical structure which characterize the inter-relation between categories.

Fig. 1.
figure 1

The framework of hierarchical classifier training. The blue elliptical nodes represent the root nodes; the green circular nodes represent the intermediate nodes; the violet hexagonal nodes represent the leaf nodes. Before clustering, HCVI was utilized to select the optimal number of clusters. This process is applied recursively until the leaf node is reached, and then the hierarchical classifier can be trained over the visual tree from top to bottom. (Color figure online)

In general, the existing approaches for building hierarchical structure can be roughly divided into three types: (1) Semantic tree [12, 22, 33]. It builds an hierarchical structure by leveraging the semantic ontology in the real world. However, it cannot characterise the inter-relation between categories in the feature space. (2) Label tree [9, 23]. To learn a label tree, we need to train a flat one-versus-rest (OVR) binary classifiers first, and then utilize the classification results to build the visual tree. However, the label tree structure always suffer from the imbalanced data and training efficiency. (3) Visual tree [13, 30, 41]. In general visual tree learning, a large number of categories can be organized hierarchically in a coarse-to-fine fashion with hierarchical clustering. Because the feature space is the common space for classifier training and classification, the visual tree can provide a good environment to characterize the inter-relation between categories. However, the number of cluster centers will profoundly influence the structure of the visual tree. Thus, how to determine the number of clusters is a critical issue.

Therefore, the suitable clustering number of hierarchical clustering is the key to building a reasonable visual tree and training a more discriminative classifier. It is necessary to find a way to effectively evaluate the goodness of clustering in order to select the suitable number of clusters. It is worth noting that the cluster validity index (CVI) is often used to evaluate the success of clustering applications [24, 25]. Cluster validity index can be roughly divided into two categories: external cluster validity index and internal cluster validity index. The main difference is whether the external information is used in the cluster validity. Usually, the external information refers to the category labels. For visual tree, the objects of clustering are categories instead of samples, so there is no external information available for visual tree structure. Therefore, only internal cluster validity index can be used to guide the visual tree building.

Based on these observations, in this paper, an hierarchical cluster validity index (HCVI) is developed for supporting visual tree learning. The HCVI will consider both the clustering results of each level and the structural rationality of the visual tree. In hierarchical clustering, we will measure the impact of different numbers of clusters on visual tree building and select the most suitable number of clusters before clustering of each level begins. Based on the visual tree, a hierarchical classifier can be trained from top to bottom. Figure 1 illustrates the framework of hierarchical classifier training.

This paper is organized as follows. In Sect. 2, we review some relevant work. In Sect. 3, we present the proposed HCVI algorithm for visual tree learning. In Sect. 4, we present our experiments for algorithm evaluation. Section 5 provides some conclusions.

2 Related Work

The existing approaches for building hierarchical structure can be divided into three groups: (a) semantic tree; (b) label tree; (c) visual tree. Some researchers utilize the semantic ontology to organize large numbers of categories hierarchically [8, 12, 22, 26, 27, 33]. Marszalek et al. employ the affiliation between nouns of WordNet to build a semantic tree for visual recognition [26]. Li et al. utilize both image and tag information to discover the semantic image hierarchy, and than employ this hierarchy to encode the inter-categories relations [22]. Fan et al. integrate semantic ontology and multi-task learning to complete the multi-level image annotation [12]. Some researchers build the label tree structure in the feature space [1, 9, 15, 29, 36]. Bengio et al. propose a label embedding tree for multi-class tasks [1]. Griffin et al. automatically generate useful taxonomies for learning hierarchical relationships between categories [15]. However, the label tree structure always suffer from the imbalanced data and training efficiency. Therefore, other researchers learn the visual tree by hierarchical clustering directly [28, 30, 40, 41]. Zheng et al. utilize hierarchical affinity propagation clustering and active learning to build the visual tree [40]. Nister et al. built a vocabulary tree by employing hierarchical clustering [28].

Cluster validity index can be roughly divided into two categories: external cluster validity index and internal cluster validity index. External cluster validity is a measure for evaluating the quality of a clustering by employing the ground truth partition [21, 24, 25]. At present, many external cluster validity indexes have been proposed, such as: Rand Index (RI) [31], Adjusted Rand Index (ARI) [17], Fowlkes and Mallow index (FM) [14], Jaccard Index (JI) [18]. However, in visual tree learning, no ground truth information is available, so the internal cluster validity index should be used. Internal cluster validity index has been widely used in selecting the number of clusters. Calinski et al. proposed the Calinski-Harabasz index (CH), and it defined as the average between- and within- cluster sum of squares [4, 24]. Davies et al. proposed the Davies-Bouldin index (DB), and it defined as the sum ratio of within-cluster scatter to between-cluster separation [7]. Rousseeuw proposed the Silhouette index (Si), and it is utilized to evaluate the consistency within clusters of data [32]. Tibshirani et al. focused the well separated clusters and developed a Gap index (Gap) [34]. Dunn proposed the Dunn index (Dunn), and it defined as the ratio between the inter-cluster separation to the intra-cluster compactness [11]. Hartigan proposed the Hartigan index (Har) [5, 16].

3 Hierarchical Cluster Validity Index for Visual Tree Learning

In general, both external cluster validity index and internal cluster validity index are used to evaluate the performance of clustering. If we want to use CVI to guide visual tree learning, the most direct method is to find a reasonable CVI and use it to select the suitable number of clusters before each level clustering starts. This approach is appropriate for hierarchical clustering alone. However, although the visual tree is built by employing the hierarchical clustering, its purpose is not to get a good clustering result, but to train a discriminative hierarchical classifier based on it. No matter what CVI is used, it can only select the optimal number of clusters for a single clustering. However, one hierarchical clustering contains many sub-clustering. As computer scientists often say: local greed does not guarantee the global optimum, a satisfactory visual tree structure cannot be obtained through traditional internal CVI guidance. For example: according to CVI, one hierarchical clustering tends to choose fewer clusters at each level, the obtaining visual tree will be deep and narrow, and then the more node classifiers will be trained on one path of hierarchical classifier. Unfortunately, at some times, the more node classifiers passed, the lower the classification accuracy will be.

Based on these understanding, we propose a hierarchical cluster validity index that can measure the clustering validity while taking care of visual tree learning. The vast majority of CVIs are designed based on two key criteria: compactness and separation. The compactness measures the distance between the cluster center and samples in one cluster. Separation measures the pairwise distances between cluster centers. The existing methods have done a good job on these two criteria. Therefore, our hierarchical cluster validity index (HCVI) mainly focuses on visual tree learning. Specifically, we design a parameter based on the clustering results to measure whether the current cluster is suitable for building a visual tree. After that, we combined this parameter with the common CVIs to construct HCVI and employing the HCVI to guide the visual tree building.

Fig. 2.
figure 2

The overly imbalanced structure of the visual tree. For sub-figure (a), most categories are grouped into one cluster, which leads to category imbalance. For sub-figure (b), Although the categories are relatively balanced, the huge difference in the number of samples in different categories leads to data imbalance.

In the real-world, large numbers of categories are usually imbalanced in the feature space (e.g., some of them have strong inter-category similarities, while others may have weaker inter-category similarities). Therefore, hierarchical clustering also generates an imbalanced visual tree. However, an overly imbalanced structure can also have negative effects on the training of hierarchical classifiers. Figure 2(a) illustrates an overly imbalanced structure. In this figure, each circle represents one category, and one can observe that most categories are grouped into one cluster. It will lead to imbalanced data problems when training hierarchical classifiers over this visual tree. In order to solve this problem, we have developed a parameter to evaluate the category balance, it defined as:

$$\begin{aligned} \sum \limits _{k = 1}^q {[{{(\frac{{{r_k} - {r_E}}}{{{r_E}}})}^2} + 1]} \end{aligned}$$
(1)

where parameter q indicates number of clusters. \(r_E\) indicates the average number of categories for one cluster. \(r_k\) indicates the number of categories contained in the kth cluster.

This parameter indicates the category balance in the clustering. The larger the parameter, the more imbalanced it is. On the other hand, in visual tree learning, the clustering objects are categories. However, when training hierarchical classifiers over the visual tree, every sample needs to be used for training. Therefore, we also need to consider the sample balance. Figure 2(b) illustrates an overly sample imbalance. One can observe that the number of categories in each cluster is almost equal, but the number of samples in each category varies greatly, which can seriously affect the training of hierarchical classifiers. Therefore, we have developed another parameter to evaluate the sample balance, it defined as:

$$\begin{aligned} \sum \limits _{k = 1}^q {[{{(\frac{{{m_k} - {m_E}}}{{{m_E}}})}^2} + 1]} \end{aligned}$$
(2)

where \(m_E\) indicates the average number of samples for one cluster. \(m_k\) indicates the number of samples contained in the kth cluster.

In order to measure the category and sample balance simultaneously, we combine these two parameters to generate a balance parameter, it defined as:

$$\begin{aligned} \delta (q) = \frac{1}{q}\sum \limits _{k = 1}^q {({{(\frac{{{r_k} - {r_E}}}{{{r_E}}})}^2} + 1)} ({(\frac{{{m_k} - {m_E}}}{{{m_E}}})^2} + 1) \end{aligned}$$
(3)

This parameter measures the balance of the visual tree learning. The smaller the value, the better the balance. We employ the balance parameter in combination with common CVIs as HCVI to measure the clustering effect so that the optimal number of clusters for hierarchical clustering can be selected. Some CVIs are the bigger the better, such as: CH [4], we denote HCVI as \(CH/\delta (q)\), meanwhile, others are the smaller the better, such as: DB [7], we denote the HCVI as \(DB \cdot \delta (q)\).

4 Experimental Results

4.1 Notation and Definitions

In this section, we introduce the notations used in the experiment [5], and than provide the definitions about HCVIs and internal CVIs, such as: CH [4], DB [7], Si [32], Dunn [11] and Har [16].

In the following, we denote:

  • \(n = \) number of samples;

  • \(p = \) number of variables;

  • \(q = \) number of clusters;

  • \(X = \left\{ {{x_{ij}}} \right\} , i = 1,...,n j = 1,...,p \);

  • \(\overline{x} = \) centroid of data matrix X;

  • \(C_k = \) the k-th clusters;

  • \(n_k = \) number of objects in cluste \(C_k\);

  • \(c_k\) = centroid of cluster \(C_k\);

  • \(d(x,y) = \) distance between x and y;

  • \(x_i\) = p-dimensional vector of samples of the i-th object in cluster \(C_k\);

  • \(\left\| x \right\| = {({x^\mathrm{T}}x)^{1/2}}\);

  • \({W_q} = \sum \limits _{k = 1}^q {\sum \limits _{i \in {C_k}} {({x_i} - {c_k}){{({x_i} - {c_k})}^\mathrm{T}}} }\) is the within-class dispersion matrix;

  • \({B_q} = \sum \limits _{k = 1}^q {{n_k}({c_k} - \overline{x} ){{({c_k} - \overline{x} )}^\mathrm{T}}}\) is the between-class dispersion matrix;

  • \({N_t} = n(n - 1)/2\);

  • \({N_w} = \sum \limits _{k = 1}^q {{n_k}({n_k} - 1)/2}\);

  • \({N_b} = {N_t} - {N_w}\);

  • \({S_w} = \sum \limits _{k = 1}^q {\sum \limits _{i,j \in {C_k},i < j} {d({x_i},{x_j})} }\) is sum of the within-cluster distances;

  • \({S_b} = \sum \limits _{k = 1}^{q - 1} {\sum \limits _{l = k + 1}^q {\sum \limits _{i \in {C_k},j \in {C_l}} {d({x_i},{y_i})} } }\) is sum of the between-cluster distances.

Table 1. Definitions of cluster validity index.

Based on these notations, Table 1 shows 5 widely used internal cluster validity index and its corresponding hierarchical cluster validity index. The “Method” column gives the full name of these indices, and the “Notation” column gives the abbreviation. The “CVI Definition” column gives the computation formulas of these indices and the “HCVI Definition” column gives the corresponding hierarchical forms.

4.2 Experimental Settings

In order to verify the effectiveness of the proposed balance parameter, we compare the common CVIs and the balance parameter based HCVIs through experiments. We employ K-means as the clustering algorithm for experiment. All the experiments are carried out on Matlab 2015a. In our experiments, DB, CH, Si can be implemented by using the Statistics and Machine Learning Toolbox of Matlab. We implement the Har index by employed part of CVAP toolbox [35]. Our experimental environment is: a single machine with 4 cores and 16GB memory.

4.3 Experiment for Balance Parameter

In this experiment, we evaluate the proposed approach on Fisheriris data set. Fisheriris data set is one of available data set at UCI machine learning repository [2]. It has 150 samples with 50 samples in each category. The dimension of the original data is 4. To facilitate visualization, we use the first two dimensions of each sample as one sample and the last two dimensions as another. In this way, there are 300 samples in total. In the experiment, we use 5 common CVIs and their corresponding HCVIs to evaluate the clustering results with different number of clusters. It is worth noting that each sample in this experiment represents only one single sample, so HCVIs can only evaluate the sample balance.

Table 2. Criterion values of CVIs.
Table 3. Balance parameter of Fisheriris data set.

Table 2 shows the criterion values of CVIs. The bold value is the optimal criterion values of different CVIs. One can observe that the optimal number of clusters derived from different CVIs is not identical, even though the clustering data is the same. It shows the criteria of different CVIs vary widely. Since we have reconstructed the data set, the original labels has been invalidated, so we cannot evaluate which CVI is better. However, our main purpose is not to find the optimal indicators, but to verify the effectiveness of the balance parameters. Table 3 shows the balance parameter of different number of clusters. From the result, it is obvious that the clustering result is the most balanced when the number of clusters q = 5. From Tables 2 and 3, we can observe that most common CVIs do not pay attention to the balance of clustering, which is the precisely concern of building a visual tree. Therefore, HCVI is a reasonable choice for considering both clustering goodness and balance. Figure 3 illuminates the cluster assignments and the criterion values of CVIs and their corresponding HCVIs. The first two columns show the results of the CVI, and the last two columns show the results of the corresponding HCVI. We can observe that the common CVIs tend to choose fewer clusters, while HCVIs tend to choose more clusters. In particular, the DB and Si indices both consider \(q=2\) as the optimal number of clusters, however, it results in a very imbalanced clustering result. After using the balance parameter, the HCVIs of DB and Si have selected a reasonable cluster number that makes the clustering results more balanced. It is worth noting that the balance parameter does not improve balance of Har index. It shows that the Har index hardly considers the clustering balance as a criterion. In summary, we can conclude that the proposed balance parameter can effectively improve the performance of CVIs in terms of clustering balance.

Fig. 3.
figure 3

The cluster assignments and the criterion values of CVIs and their corresponding HCVIs.

4.4 Experiment for Hierarchical Classification

In this section, we evaluate the proposed HVCIs comparing the classification accuracy of different visual tree structures. In the experiment, we employ the proposed HCVI to build visual trees and train hierarchical classifiers based on these visual trees. Our experiment are carried out on two data sets: CIFAR-100 [20], the ILSVRC-2012 [8]. CIFAR-100 has 100 image categories and each category contains 600 images. We randomly select 10,000 images, half for training and half for testing. ILSVRC-2012 data set is a subset of ImageNet. It contains 1000 image categories and each category has over 1,000 images. We randomly select 20,000 images, half for training and half for testing. In the experiment, we employ DeCAF features as the image representation [10], and then use PCA to reduce the dimensionality of the DeCAF from 4096 to 128.

Table 4. Classification results on CIFAR-100 and ILSVRC-2012 image set.

In this experiments, we compare the proposed HCVI-visual tree structure with two types of tree structure: CVI-visual tree structure and traditional hierarchical structure. In particular, traditional hierarchical structures contains: semantic tree [27], label tree [15], visual tree [40] and EVT [40]. We train hierarchical classifiers based on these tree structure and compare their classification results. We employ K-means as the clustering algorithm for experiment and the SVM classifier as the node classifiers. The Mean Accuracy (%) is used as the criterion to evaluate the performance of all approaches. The experimental results are shown in Table 4. We can observe that the hierarchical classifiers based on visual trees which utilizing cluster validity indices can achieve better results. The reason is that the cluster validity indices allows us to get better clustering results, so as to get more discriminative visual trees. In addition, most of HCVIs-based methods have achieved better results, which illustrates the effectiveness of the proposed balance parameter. It’s worth noting that CH index based method achieve higher classification accuracy compared to HCH index based method. One possible reason is that the HCH index considers balance too much and ignores the compactness and the separation of clustering. In general, we can obtain more reasonable visual tree structures through the guidance of HCVIs to help train more discriminative hierarchical classifiers.

5 Conclusion

In this paper, a hierarchical cluster validity index (HCVI) is developed to achieve more discriminative solution for visual tree learning, where the hierarchical classifiers can be trained over the visual trees. Our HCVI integrate the proposed balance parameter and the common CVIs. Both the balance of visual tree and the effectiveness of clustering are leveraged to learn more discriminative hierarchical structure. Therefore, the hierarchical classifier can achieve better results. The experimental results have demonstrated that our hierarchical cluster validity index has superior performance as compared with other cluster validity indices on both the clustering balance and the classification accuracy.