Artificial datasets for hierarchical classification
Introduction
The need of evaluating existing and new methods for Hierarchical Classification is what inspired this work. There are some real world datasets (Ramírez-Corona et al., 2016, Silla and Freitas, 2011, Barutçuoglu et al., 2008) which have been used for evaluating new methods. But the evaluation over those datasets provide little information about the method itself, such as if it fails due to the difficulty of the datasets, or if it is affected by the amount of information available. Thus, artificial datasets are an option to evaluate methods over different conditions and to identify where the methods fail for improving them.
Synthetic data have been used for different purposes such as creating synthetic instances (Bowyer, Chawla, Hall, & Kegelmeyer, 2011) and generating artificial datasets (Barutçuoglu et al., 2008, Patki et al., 2016, Melville and Mooney, 2003, Beyan and Fisher, 2015, Cesa-Bianchi et al., 2006). So in this paper is proposed a method for generating different artificial datasets for up to four of the hierarchical classification problems described by Silla and Freitas (2011), where the instances are associated to a single path of labels (SPL).
The main idea of our method is that given any hierarchical structure, DAG or Tree, and a probability distribution for each leaf node, an artificial dataset of Full Depth (FD) can be generated sampling from the distribution of each node. For artificial datasets of Partial Depth (PD), it requires knowing the distribution for each internal node, which are not provided as input; but it is known that upper nodes represent general information while lower nodes represent specific information. Thus, distributions for internal nodes are estimated from distributions of its leaf nodes descendants, then, artificial datasets of PD type can be generated. Additionally, two variants are proposed. First are balanced leaves, in which all leaf nodes have the same number of instances regardless the level in which the leaf node is. Second are unbalanced leaves, where a ratio value for each leaf node is used to generate different number of instances.
The proposed method for generating artificial datasets is very flexible on the structure of the hierarchy, while it has no limits on the number of nodes or the number of attributes. Therefore, artificial datasets can be generated according to the application of interest.
Furthermore, we have generated artificial datasets1 which are available to the scientific community, also in the same link can be found the source code in order that you can generate your own datasets. These datasets are divided by hierarchical structure, that is, Tree or DAG. Those groups are divided by the level of difficulty: easy and hard, the Tree type also in very hard.
Subsequently, some standard and state of the art methods are evaluated to identify which methods have good performance in the different datasets, most of them only predict a single path. The standard methods Top-Down (TD) and Flat, together with some of the state of the art techniques such as HCP (Serrano-Pérez & Sucar, 2019), HCA (Serrano-Pérez & Sucar, 2019), HCC (Serrano-Pérez & Sucar, 2019), nLLCPN (Nakano, Pinto, Pappa, & Cerri, 2017), LCPNB (Nakano et al., 2017), CPE (Ramírez-Corona et al., 2016) and CLUS-HMC (Vens, Struyf, Schietgat, Dzeroski, & Blockeel, 2008) have been evaluated with the proposed artificial datasets. Later, Friedman test with its post hoc Nemenyi test were used to compare the performance of the different classifiers. Additionally, we provide precision and recall tables as supplementary material.
The contributions of this paper are three: first, a method that is able to generate up to four hierarchical classification problems; second, artificial datasets with Tree and DAG structures with different levels of difficulty, which are available to the scientific community; and third, a comparison of several methods for hierarchical classification.
The document is organized as follow. Section 2 summarizes the fundamentals of hierarchical classification. Section 3 reviews related work. Section 4 presents the proposed method to build artificial datasets for hierarchical classification. Section 5 describes the artificial datasets generated with the proposed method. Section 6 describes the methods that were evaluated with the artificial datasets and their results. Finally, in Section 7 the conclusions and some ideas for future work are given.
Section snippets
Fundamentals of hierarchical classification
Hierarchical classification is a special type of multilabel classification in which the labels are arranged in a predefined structure, the structure can be a Tree or in its general form a Directed Acyclic Graph (DAG). Thus the Hierarchical Structure (HS) can be denoted with the notation of graphs:where L is the set of nodes (labels/classes), E is the set of edges that links the nodes and HS is a DAG. Note that a Tree is a DAG where all nodes have only one parent, except the root node
Related work
Synthetic data have been used in different ways, from adding some instances to a class to creating completely new datasets. Thus, some methods that create synthetic data are presented below, nevertheless, not all generate artificial data for hierarchical classification problems.
Synthetic data have been used to add extra instances to training datasets. Due to naturally unbalanced datasets for hierarchical classification, Feng, Fu, and Zheng (2018) use the sibling policy and then the SMOTE method
A method to build artificial data from a hierarchy
The need of evaluating existing and new methods for Hierarchical Classification is what inspired this work. In the related work, there are some real world datasets (Ramírez-Corona et al., 2016, Silla and Freitas, 2011, Nakano et al., 2017) which have been used for evaluating new methods for hierarchical classification. But the evaluation over those datasets provide little information about the method itself, such as if it fails due to the difficulty of the datasets, or if it is affected by
Artificial Datasets for Hierarchical Classification
In this section, the artificial datasets3 (AD) for hierarchical classification (HC) are introduced. The AD are divided in two groups, those for tree hierarchical structure and those for DAGs.
The reported run times for generating the datasets, for both training and test, were obtained by executing a Python script on a personal computer with Intel Core i3-2350 M, 8 GB RAM and 45 GB HDD.
Evaluation Measures
There are several evaluation measures used in Hierarchical Classification such as exact-match (Hernandez, Sucar, & Morales, 2013), accuracy (Ramírez-Corona et al., 2016, Secker et al., 2007, Babbar et al., 2013), hamming-accuracy (Ramírez-Corona et al., 2016), hierarchical f-measure (Silla and Freitas, 2011, Kiritchenko et al., 2006, Naik and Rangwala, 2017, Naik and Rangwala, 2016), etc. But, due to the large number of artificial datasets proposed in this paper, we will focus our results
Conclusions and Future Work
In this paper, a method was proposed to generate artificial datasets for the following problems of hierarchical classification: T,SPL,FDT,SPL,PDDAG,SPL,FD and DAG,SPL,PD. The method makes use of probability distributions, that is, instances are generated from those distributions. The method requires as input data the hierarchy and distributions for each leaf node, then, it estimates the distributions for internal nodes.
Moreover, we generated some artificial datasets with hierarchical
CRediT authorship contribution statement
Jonathan Serrano-Pérez: Conceptualization, Software, Validation, Formal analysis, Writing - original draft, Writing - review & editing. L. Enrique Sucar: Conceptualization, Formal analysis, Resources, Writing - original draft, Writing - review & editing, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work has been partially supported by CONACYT.
References (23)
- et al.
Classifying imbalanced data sets using similarity based hierarchical decomposition
Pattern Recognition
(2015) - et al.
Hierarchical multilabel classification based on path evaluation
International Journal of Approximate Reasoning
(2016) - et al.
A framework to generate synthetic multi-label datasets
Electronic Notes in Theoretical Computer Science
(2014) - et al.
Maximum-margin framework for training data synchronization in large-scale hierarchical classification
- et al.
Bayesian Aggregation for Hierarchical Classification
(2008) - Bowyer, K.W., Chawla, N.V., Hall, L.O., & Kegelmeyer, W.P. (2011). SMOTE: synthetic minority over-sampling technique....
- et al.
Hierarchical classification: Combining bayes with svm
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
(2006)- et al.
A hierarchical multi-label classification method based on neural networks for gene function prediction
Biotechnology & Biotechnological Equipment
(2018) - Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014)....
A hybrid global-local approach for hierarchical classification
Cited by (2)
Statistical Modelling by Topological Maps of Kohonen for Classification of the Physicochemical Quality of Surface Waters of the InaouenWatershed Under Matlab
2022, Journal of the Nigerian Society of Physical Sciences