Artificial datasets for hierarchical classification

https://doi.org/10.1016/j.eswa.2021.115218Get rights and content

Highlights

  • A novel method for generating hierarchical artificial datasets is proposed.

  • Several hierarchical datasets of tree and directed acyclic graph type are generated.

  • Hierarchical classification methods are evaluated with the artificial datasets.

  • The datasets and the program to generate them are made available to the community.

Abstract

Hierarchical classification (HC) is a especial type of multilabel classification, where an instance can be associated to multiple labels, but in HC the labels are arranged in a predefined structure, commonly a tree but in its general form a Directed Acyclic Graph (DAG). HC includes up to eight different problems, and when a method is proposed to solve one of them, the real world datasets for each problem is limited. Thus, a way to extend the evaluation of a method is to generate Artificial Datasets (ADs). ADs are useful to evaluate a method in different conditions that could not be present in the available datasets. Thus, in this work is proposed a method that is able to generate artificial datasets for up to four of the different hierarchical classification problems, which makes use of distributions to generate the instances. Furthermore, two groups of ADs were generated using the proposed method, Tree and DAG hierarchies, which are made available to the scientific community; also the source code is provided so that you can generate your own datasets. Finally, standard and state of the art methods were evaluated with the generated artificial datasets. The best performance in the datasets was obtained by a couple of methods of the state of the art which make use of Bayesian networks and chained classifiers. The proposed method for generating HC datasets provides a flexible and general alternative to evaluate different hierarchical classification methods.

Introduction

The need of evaluating existing and new methods for Hierarchical Classification is what inspired this work. There are some real world datasets (Ramírez-Corona et al., 2016, Silla and Freitas, 2011, Barutçuoglu et al., 2008) which have been used for evaluating new methods. But the evaluation over those datasets provide little information about the method itself, such as if it fails due to the difficulty of the datasets, or if it is affected by the amount of information available. Thus, artificial datasets are an option to evaluate methods over different conditions and to identify where the methods fail for improving them.

Synthetic data have been used for different purposes such as creating synthetic instances (Bowyer, Chawla, Hall, & Kegelmeyer, 2011) and generating artificial datasets (Barutçuoglu et al., 2008, Patki et al., 2016, Melville and Mooney, 2003, Beyan and Fisher, 2015, Cesa-Bianchi et al., 2006). So in this paper is proposed a method for generating different artificial datasets for up to four of the hierarchical classification problems described by Silla and Freitas (2011), where the instances are associated to a single path of labels (SPL).

The main idea of our method is that given any hierarchical structure, DAG or Tree, and a probability distribution for each leaf node, an artificial dataset of Full Depth (FD) can be generated sampling from the distribution of each node. For artificial datasets of Partial Depth (PD), it requires knowing the distribution for each internal node, which are not provided as input; but it is known that upper nodes represent general information while lower nodes represent specific information. Thus, distributions for internal nodes are estimated from distributions of its leaf nodes descendants, then, artificial datasets of PD type can be generated. Additionally, two variants are proposed. First are balanced leaves, in which all leaf nodes have the same number of instances regardless the level in which the leaf node is. Second are unbalanced leaves, where a ratio value for each leaf node is used to generate different number of instances.

The proposed method for generating artificial datasets is very flexible on the structure of the hierarchy, while it has no limits on the number of nodes or the number of attributes. Therefore, artificial datasets can be generated according to the application of interest.

Furthermore, we have generated artificial datasets1 which are available to the scientific community, also in the same link can be found the source code in order that you can generate your own datasets. These datasets are divided by hierarchical structure, that is, Tree or DAG. Those groups are divided by the level of difficulty: easy and hard, the Tree type also in very hard.

Subsequently, some standard and state of the art methods are evaluated to identify which methods have good performance in the different datasets, most of them only predict a single path. The standard methods Top-Down (TD) and Flat, together with some of the state of the art techniques such as HCP (Serrano-Pérez & Sucar, 2019), HCA (Serrano-Pérez & Sucar, 2019), HCC (Serrano-Pérez & Sucar, 2019), nLLCPN (Nakano, Pinto, Pappa, & Cerri, 2017), LCPNB (Nakano et al., 2017), CPE (Ramírez-Corona et al., 2016) and CLUS-HMC (Vens, Struyf, Schietgat, Dzeroski, & Blockeel, 2008) have been evaluated with the proposed artificial datasets. Later, Friedman test with its post hoc Nemenyi test were used to compare the performance of the different classifiers. Additionally, we provide precision and recall tables as supplementary material.

The contributions of this paper are three: first, a method that is able to generate up to four hierarchical classification problems; second, artificial datasets with Tree and DAG structures with different levels of difficulty, which are available to the scientific community; and third, a comparison of several methods for hierarchical classification.

The document is organized as follow. Section 2 summarizes the fundamentals of hierarchical classification. Section 3 reviews related work. Section 4 presents the proposed method to build artificial datasets for hierarchical classification. Section 5 describes the artificial datasets generated with the proposed method. Section 6 describes the methods that were evaluated with the artificial datasets and their results. Finally, in Section 7 the conclusions and some ideas for future work are given.

Section snippets

Fundamentals of hierarchical classification

Hierarchical classification is a special type of multilabel classification in which the labels are arranged in a predefined structure, the structure can be a Tree or in its general form a Directed Acyclic Graph (DAG). Thus the Hierarchical Structure (HS) can be denoted with the notation of graphs:HS=(L,E)where L is the set of nodes (labels/classes), E is the set of edges that links the nodes and HS is a DAG. Note that a Tree is a DAG where all nodes have only one parent, except the root node

Related work

Synthetic data have been used in different ways, from adding some instances to a class to creating completely new datasets. Thus, some methods that create synthetic data are presented below, nevertheless, not all generate artificial data for hierarchical classification problems.

Synthetic data have been used to add extra instances to training datasets. Due to naturally unbalanced datasets for hierarchical classification, Feng, Fu, and Zheng (2018) use the sibling policy and then the SMOTE method

A method to build artificial data from a hierarchy

The need of evaluating existing and new methods for Hierarchical Classification is what inspired this work. In the related work, there are some real world datasets (Ramírez-Corona et al., 2016, Silla and Freitas, 2011, Nakano et al., 2017) which have been used for evaluating new methods for hierarchical classification. But the evaluation over those datasets provide little information about the method itself, such as if it fails due to the difficulty of the datasets, or if it is affected by

Artificial Datasets for Hierarchical Classification

In this section, the artificial datasets3 (AD) for hierarchical classification (HC) are introduced. The AD are divided in two groups, those for tree hierarchical structure and those for DAGs.

The reported run times for generating the datasets, for both training and test, were obtained by executing a Python script on a personal computer with Intel Core i3-2350 M, 8 GB RAM and 45 GB HDD.

Evaluation Measures

There are several evaluation measures used in Hierarchical Classification such as exact-match (Hernandez, Sucar, & Morales, 2013), accuracy (Ramírez-Corona et al., 2016, Secker et al., 2007, Babbar et al., 2013), hamming-accuracy (Ramírez-Corona et al., 2016), hierarchical f-measure (Silla and Freitas, 2011, Kiritchenko et al., 2006, Naik and Rangwala, 2017, Naik and Rangwala, 2016), etc. But, due to the large number of artificial datasets proposed in this paper, we will focus our results

Conclusions and Future Work

In this paper, a method was proposed to generate artificial datasets for the following problems of hierarchical classification: <T,SPL,FD>,<T,SPL,PD>,<DAG,SPL,FD> and <DAG,SPL,PD>. The method makes use of probability distributions, that is, instances are generated from those distributions. The method requires as input data the hierarchy and distributions for each leaf node, then, it estimates the distributions for internal nodes.

Moreover, we generated some artificial datasets with hierarchical

CRediT authorship contribution statement

Jonathan Serrano-Pérez: Conceptualization, Software, Validation, Formal analysis, Writing - original draft, Writing - review & editing. L. Enrique Sucar: Conceptualization, Formal analysis, Resources, Writing - original draft, Writing - review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work has been partially supported by CONACYT.

References (23)

  • C. Beyan et al.

    Classifying imbalanced data sets using similarity based hierarchical decomposition

    Pattern Recognition

    (2015)
  • M. Ramírez-Corona et al.

    Hierarchical multilabel classification based on path evaluation

    International Journal of Approximate Reasoning

    (2016)
  • J.T. Tomás et al.

    A framework to generate synthetic multi-label datasets

    Electronic Notes in Theoretical Computer Science

    (2014)
  • R. Babbar et al.

    Maximum-margin framework for training data synchronization in large-scale hierarchical classification

  • Z. Barutçuoglu et al.

    Bayesian Aggregation for Hierarchical Classification

    (2008)
  • Bowyer, K.W., Chawla, N.V., Hall, L.O., & Kegelmeyer, W.P. (2011). SMOTE: synthetic minority over-sampling technique....
  • N. Cesa-Bianchi et al.

    Hierarchical classification: Combining bayes with svm

  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • S. Feng et al.

    A hierarchical multi-label classification method based on neural networks for gene function prediction

    Biotechnology & Biotechnological Equipment

    (2018)
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014)....
  • J. Hernandez et al.

    A hybrid global-local approach for hierarchical classification

  • Cited by (2)

    View full text