Elsevier

Information Sciences

Volume 581, December 2021, Pages 536-552
Information Sciences

Hierarchical classification of data with long-tailed distributions via global and local granulation

https://doi.org/10.1016/j.ins.2021.09.059Get rights and content

Abstract

Automated learning from datasets with a long-tailed distribution has gradually become a research hotspot due to the increasing complexity of large-scale real-world datasets. Existing solutions to long-tailed data classification usually involve re-balancing strategies for global optimization, which can achieve satisfactory results. However, re-balancing strategies tend to alter the original data. In this paper, we propose a knowledge granulation method based on global and local granulation to assist the hierarchical classification of long-tailed data without altering the original data. Firstly, a global classifier is constructed based on the WordNet knowledge organization’s hierarchical structure, which is used to granulate the global data from coarse to fine. Secondly, a local hierarchical classifier adapted to tail data is constructed for tail classes that contain few samples. The hierarchical structure of this local classifier is obtained by granulating the data via spectral clustering rather than by using the semantic hierarchy of classes. Finally, the global classifier is used to preliminarily classify samples, then uncertain samples are further classified by the tail local classifier. Experimental results show that the proposed method outperforms several state-of-the-art models designed for the hierarchical classification of long-tailed data.

Introduction

Classification learning algorithms may have poor performance when applied to a dataset with a long-tailed distribution, which presents challenges to data mining and machine learning applications. In a long-tailed distribution, a small proportion of classes account for the majority of data, while most of the other classes lack enough data to be representative [2]. In 2006, Anderson [1] proposed to use long-tailed theory to describe the business and economic models of websites. Recently, existing long-tailed distribution learning models have been widely applied to various research fields, such as object detection [47], project recommendation [17], [38], and image classification [37].

Long-tailed distribution learning is a particular classification task in machine learning and has been widely studied [15], [18], [39]. For instance, Yang et al. [42] proposed a scalable algorithm based on image retrieval and superpixel matching for application to scene analysis, which employs tail classes to achieve a semantic understanding of visual scenes. In document processing, Xie et al. [41] proposed a new model to diversify the hidden units and cover the themes of all regions in the long-tailed distribution. It solves the problems of redundancy and non-information caused when hidden units in the tail region are ignored. Similarly, Reinanda et al. [34] presented a document filtering method for long-tailed data with independent samples and extended it to rarely seen samples. In visual recognition, Cui et al. [2] introduced a theoretical framework that associates each sample with adjacent regions for resampling, and combined with a re-weighting scheme to solve long-tailed problems.

Resampling [27], [28] and re-weighting are two important re-balancing strategies that are often used to solve imbalance problems in long-tailed data in machine learning. On the one hand, one can directly adjust the number of samples by over-sampling [9] the minority class [20] or under-sampling the majority class [16]. For example, Kim et al. [22] explored a novel method that extends the tail class by converting samples from the classes in the head. It enables the classifier to use the diversity of information to learn comprehensive features, which significantly improves the generalizability of a few classes. Tian et al. [36] proposed a semi-supervised learning resampling method to improve imbalanced classification performance by mixing labeled data with pseudo-labeled unlabeled data. On the other hand, re-weighting adjusts the proportion of different samples by changing the weights of learning rights [19], [30]. For instance, Khan et al. [21] proposed a cost-sensitive deep neural network that can automatically learn robust representations of most classes and a few classes. Li et al. [26] used two unsupervised strategies to deal with data imbalance by re-weighting and balancing the composition of training batches.

The approaches mentioned above use re-balancing strategies to tackle the challenges of long-tailed classification, and have achieved effective results. However, over-sampling carries the risk of overfitting the tail data, while under-sampling carries the risk of under-fitting the entire data distribution. Re-weighting distorts the original distribution by directly changing or even inverting the frequencies presented by the data. Fortunately, the relationship among classes can be visualized as a hierarchical structure, which is auxiliary knowledge that can assist classification [14]. Hierarchical classification research has made rapid advances during recent years [29]. This research has been successfully applied in different areas, such as functional genomics [32], text classification [35], image classification [37], and network management [8]. Therefore, we can adopt the hierarchical structure as external knowledge to assist the classification of long-tailed data without changing the data itself.

In this paper, we propose a hierarchical classification method based on global and local granulation strategies (HCGLG) to research the effect of hierarchical structures on long-tailed learning. Firstly, global granulation is used to construct classifiers from coarse to fine considering the hierarchical structure of the WordNet knowledge organization. A large classification can be divided into several sub-classification tasks from coarse- to fine-grained using the hierarchical structure. Secondly, spectral clustering granulation from fine to coarse is used to construct a local classifier for tail classes based on the similarity between the features of tail class samples. Finally, the global classifier can preliminarily classify a test sample whose output probability is higher than the threshold. Then, the tail local classifier is used for further classification when the probability of the test sample is less than the threshold.

Experimental results on seven benchmark and two synthetic long-tailed datasets show that our model clearly outperforms existing state-of-the-art hierarchical classification methods. For instance, the HCGLG method is about 2% better than traditional hierarchical classification methods on all but one of the tested datasets. Extensive experiments verify the effectiveness of our proposed method for long-tailed data learning.

The main contributions of our manuscript can be summarized as follows: (1) We propose a hierarchical classification model to handle the long-tailed problem, which can process the head class and tail class within a single unified framework. Unlike traditional classification methods, which assume that classes are independent, the proposed model makes full use of the hierarchical structure of classes as auxiliary classification knowledge. (2) A local hierarchical classifier of the tail class is used for further classification when the uncertainty of the global classifier is greater than the threshold. Unlike traditional long-tailed distribution learning, the proposed model makes full use of the characteristics of the head and tail data and does not alter the data distribution. (3) We evaluate our method on several long-tailed image and protein datasets, finding it to consistently achieve superior performance over existing approaches.

The remainder of this paper is organized as follows. In Section 2, we introduce the details of our proposed approach. We present the experimental settings used to test our approach in Section 3. The experimental results and analysis are summarized in Section 4, then the paper concludes in Section 5.

Section snippets

Long-tailed data hierarchical classification via global and local granulation

In this section, we introduce the main framework of the long-tailed data hierarchical classification via global and local granulation (HCGLG) model.

Experimental settings

In this section, we first introduce seven real datasets and two long-tailed datasets synthesized from CIFAR100 dataset. We then introduce six comparative methods and three evaluation measures. All experiments were performed on a Windows 10 desktop computer with 24 GB memory and a 3.40 GHz Intel Core i7-3770 CPU. In addition, code and instructions related to the HCGLG algorithm have been uploaded to GitHub and can be accessed via: https://github.com/fhqxa.

Experimental results and analysis

In this section, we first discuss the parameter effects of different datasets when used with the proposed method. Then, the classification of tail classes is discussed to verify the effectiveness of the HCGLG. Furthermore, its efficacy and efficiency are compared with other methods. We utilize 10-fold cross-validation for all classification methods to obtain experimental results for comparison. Specifically, 90% of the dataset is used in the model training process and 10% is used to verify the

Conclusions and future work

This paper proposed a hierarchical classification method for long-tailed data based on global and local granulation strategies. The model employs local classification of the tail data to achieve global classification according to the state of the long-tailed data distribution. The HCGLG model divides the global data from coarse- to fine-grained to perform global and tail classification tasks. A hierarchical structure model was established for the spectral clustering of tail classes containing

CRediT authorship contribution statement

Hong Zhao: Methodology, Software, Validation, Writing – review & editing. Shunxin Guo: Methodology, Data curation, Writing – original draft. Yaojin Lin: Conceptualization, Data curation, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant Nos. 62076116 and 61703196, and the Natural Science Foundation of Fujian Province under Grant Nos. 2021J011003 and 2021J02049.

References (47)

  • J. Deng, W. Dong, R. Socher, L.J. Li, F.F. Li, ImageNet: A large-scale hierarchical image database, in: IEEE Computer...
  • J. Deng et al.

    Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition

    IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • C.H. Ding et al.

    Multi-class protein fold recognition using support vector machines and neural networks

    Bioinformatics

    (2001)
  • C. Drummond, R. Holte, C4.5, class imbalance, and cost sensitivity: Why under-sampling beats oversampling, in:...
  • O.J. Dunn

    Multiple comparisons among means

    J. Am. Stat. Assoc.

    (1961)
  • M. Everingham et al.

    The pascal visual object classes (voc) challenge

    Int. J. Comput. Vision

    (2010)
  • M. Friedman

    A comparison of alternative tests of significance for the problem of m rankings

    Ann. Math. Stat.

    (1940)
  • L. Grimaudo et al.

    Hierarchical learning for fine grained internet traffic classification

    International Wireless Communications and Mobile Computing Conference

    (2012)
  • H. He, Y. Bai, E.A. Garcia, S. Li, Adasyn: Adaptive synthetic sampling approach for imbalanced learning, in: IEEE...
  • H. He et al.

    Learning from imbalanced data

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • Y. Ho et al.

    Who likes it more? mining worth-recommending items from long tails by modeling relative preference

  • G.V. Horn, P. Perona, The devil is in the tails: Fine-grained classification in the wild, Computing Research...
  • C. Huang et al.

    Learning deep representation for imbalanced classification

    IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • Cited by (8)

    View all citing articles on Scopus
    View full text