Hierarchical classification of data with long-tailed distributions via global and local granulation
Introduction
Classification learning algorithms may have poor performance when applied to a dataset with a long-tailed distribution, which presents challenges to data mining and machine learning applications. In a long-tailed distribution, a small proportion of classes account for the majority of data, while most of the other classes lack enough data to be representative [2]. In 2006, Anderson [1] proposed to use long-tailed theory to describe the business and economic models of websites. Recently, existing long-tailed distribution learning models have been widely applied to various research fields, such as object detection [47], project recommendation [17], [38], and image classification [37].
Long-tailed distribution learning is a particular classification task in machine learning and has been widely studied [15], [18], [39]. For instance, Yang et al. [42] proposed a scalable algorithm based on image retrieval and superpixel matching for application to scene analysis, which employs tail classes to achieve a semantic understanding of visual scenes. In document processing, Xie et al. [41] proposed a new model to diversify the hidden units and cover the themes of all regions in the long-tailed distribution. It solves the problems of redundancy and non-information caused when hidden units in the tail region are ignored. Similarly, Reinanda et al. [34] presented a document filtering method for long-tailed data with independent samples and extended it to rarely seen samples. In visual recognition, Cui et al. [2] introduced a theoretical framework that associates each sample with adjacent regions for resampling, and combined with a re-weighting scheme to solve long-tailed problems.
Resampling [27], [28] and re-weighting are two important re-balancing strategies that are often used to solve imbalance problems in long-tailed data in machine learning. On the one hand, one can directly adjust the number of samples by over-sampling [9] the minority class [20] or under-sampling the majority class [16]. For example, Kim et al. [22] explored a novel method that extends the tail class by converting samples from the classes in the head. It enables the classifier to use the diversity of information to learn comprehensive features, which significantly improves the generalizability of a few classes. Tian et al. [36] proposed a semi-supervised learning resampling method to improve imbalanced classification performance by mixing labeled data with pseudo-labeled unlabeled data. On the other hand, re-weighting adjusts the proportion of different samples by changing the weights of learning rights [19], [30]. For instance, Khan et al. [21] proposed a cost-sensitive deep neural network that can automatically learn robust representations of most classes and a few classes. Li et al. [26] used two unsupervised strategies to deal with data imbalance by re-weighting and balancing the composition of training batches.
The approaches mentioned above use re-balancing strategies to tackle the challenges of long-tailed classification, and have achieved effective results. However, over-sampling carries the risk of overfitting the tail data, while under-sampling carries the risk of under-fitting the entire data distribution. Re-weighting distorts the original distribution by directly changing or even inverting the frequencies presented by the data. Fortunately, the relationship among classes can be visualized as a hierarchical structure, which is auxiliary knowledge that can assist classification [14]. Hierarchical classification research has made rapid advances during recent years [29]. This research has been successfully applied in different areas, such as functional genomics [32], text classification [35], image classification [37], and network management [8]. Therefore, we can adopt the hierarchical structure as external knowledge to assist the classification of long-tailed data without changing the data itself.
In this paper, we propose a hierarchical classification method based on global and local granulation strategies (HCGLG) to research the effect of hierarchical structures on long-tailed learning. Firstly, global granulation is used to construct classifiers from coarse to fine considering the hierarchical structure of the WordNet knowledge organization. A large classification can be divided into several sub-classification tasks from coarse- to fine-grained using the hierarchical structure. Secondly, spectral clustering granulation from fine to coarse is used to construct a local classifier for tail classes based on the similarity between the features of tail class samples. Finally, the global classifier can preliminarily classify a test sample whose output probability is higher than the threshold. Then, the tail local classifier is used for further classification when the probability of the test sample is less than the threshold.
Experimental results on seven benchmark and two synthetic long-tailed datasets show that our model clearly outperforms existing state-of-the-art hierarchical classification methods. For instance, the HCGLG method is about 2% better than traditional hierarchical classification methods on all but one of the tested datasets. Extensive experiments verify the effectiveness of our proposed method for long-tailed data learning.
The main contributions of our manuscript can be summarized as follows: (1) We propose a hierarchical classification model to handle the long-tailed problem, which can process the head class and tail class within a single unified framework. Unlike traditional classification methods, which assume that classes are independent, the proposed model makes full use of the hierarchical structure of classes as auxiliary classification knowledge. (2) A local hierarchical classifier of the tail class is used for further classification when the uncertainty of the global classifier is greater than the threshold. Unlike traditional long-tailed distribution learning, the proposed model makes full use of the characteristics of the head and tail data and does not alter the data distribution. (3) We evaluate our method on several long-tailed image and protein datasets, finding it to consistently achieve superior performance over existing approaches.
The remainder of this paper is organized as follows. In Section 2, we introduce the details of our proposed approach. We present the experimental settings used to test our approach in Section 3. The experimental results and analysis are summarized in Section 4, then the paper concludes in Section 5.
Section snippets
Long-tailed data hierarchical classification via global and local granulation
In this section, we introduce the main framework of the long-tailed data hierarchical classification via global and local granulation (HCGLG) model.
Experimental settings
In this section, we first introduce seven real datasets and two long-tailed datasets synthesized from CIFAR100 dataset. We then introduce six comparative methods and three evaluation measures. All experiments were performed on a Windows 10 desktop computer with 24 GB memory and a 3.40 GHz Intel Core i7-3770 CPU. In addition, code and instructions related to the HCGLG algorithm have been uploaded to GitHub and can be accessed via: https://github.com/fhqxa.
Experimental results and analysis
In this section, we first discuss the parameter effects of different datasets when used with the proposed method. Then, the classification of tail classes is discussed to verify the effectiveness of the HCGLG. Furthermore, its efficacy and efficiency are compared with other methods. We utilize 10-fold cross-validation for all classification methods to obtain experimental results for comparison. Specifically, 90% of the dataset is used in the model training process and 10% is used to verify the
Conclusions and future work
This paper proposed a hierarchical classification method for long-tailed data based on global and local granulation strategies. The model employs local classification of the tail data to achieve global classification according to the state of the long-tailed data distribution. The HCGLG model divides the global data from coarse- to fine-grained to perform global and tail classification tasks. A hierarchical structure model was established for the spectral clustering of tail classes containing
CRediT authorship contribution statement
Hong Zhao: Methodology, Software, Validation, Writing – review & editing. Shunxin Guo: Methodology, Data curation, Writing – original draft. Yaojin Lin: Conceptualization, Data curation, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant Nos. 62076116 and 61703196, and the Natural Science Foundation of Fujian Province under Grant Nos. 2021J011003 and 2021J02049.
References (47)
- et al.
Hierarchical annotation of medical images
Pattern Recogn.
(2011) - et al.
Novel feature selection and classification of internet video traffic based on a hierarchical scheme
Comput. Netw.
(2017) - et al.
Hierarchical feature selection with multi-granularity clustering structure
Inf. Sci.
(2021) - et al.
Robust hierarchical feature selection driven by data and knowledge
Inf. Sci.
(2021) - et al.
SCOP: a structural classification of proteins database for the investigation of sequences and structures
J. Mol. Biol.
(1995) - et al.
A novel aco-ga hybrid algorithm for feature selection in protein function prediction
Expert Syst. Appl.
(2009) - et al.
Multi-objective optimization for long tail recommendation
Knowl.-Based Syst.
(2016) The long tail: Why the future of business is selling less of more
Hyperion
(2006)- et al.
Class-balanced loss based on effective number of samples
IEEE Conference on Computer Vision and Pattern Recognition
(2020) Statistical comparisons of classifiers over multiple data sets
J. Mach. Learn. Res.
(2006)
Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition
IEEE Conference on Computer Vision and Pattern Recognition
Multi-class protein fold recognition using support vector machines and neural networks
Bioinformatics
Multiple comparisons among means
J. Am. Stat. Assoc.
The pascal visual object classes (voc) challenge
Int. J. Comput. Vision
A comparison of alternative tests of significance for the problem of m rankings
Ann. Math. Stat.
Hierarchical learning for fine grained internet traffic classification
International Wireless Communications and Mobile Computing Conference
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
Who likes it more? mining worth-recommending items from long tails by modeling relative preference
Learning deep representation for imbalanced classification
IEEE Conference on Computer Vision and Pattern Recognition
Cited by (8)
Uncertainty instructed multi-granularity decision for large-scale hierarchical classification
2022, Information SciencesCitation Excerpt :On the one hand, this process divides the complex problem into several simple sub-problems, which significantly lowers the dimensionality of the pending label space in classification tasks [11]. On the other hand, multiple decisions are required to reach one of the leaf nodes, indicating that any incorrect decision made at the superordinate node level can lead to a sample being classified to an unexpected leaf node [12]. In some cases, a correct leaf-node prediction is difficult to obtain due to ambiguous or incomplete information.
Simple Primitives With Feasibility- and Contextuality-Dependence for Open-World Compositional Zero-Shot Learning
2024, IEEE Transactions on Pattern Analysis and Machine IntelligenceBuilding hierarchical class structures for extreme multi-class learning
2023, International Journal of Machine Learning and CyberneticsQuantitative evaluation of ecological compensation policies for the watershed in China: based on the improved Policy Modeling Consistency Index
2022, Environmental Science and Pollution ResearchStatistical Modelling by Topological Maps of Kohonen for Classification of the Physicochemical Quality of Surface Waters of the InaouenWatershed Under Matlab
2022, Journal of the Nigerian Society of Physical Sciences