Splitting criteria for classification problems with multi-valued attributes and large number of classes
Introduction
Decision Trees and Random Forests are among the most popular methods for classification tasks. Decision Trees, specially small ones, are easy to interpret, while Random Forests usually yield more accurate classifications. One of the key issues in these methods is how to select an attribute to associate with a node of the tree/forest. An important related issue is how to split the samples once the attribute is selected.
There is a number of papers discussing aspects related with attribute selection, such as: how to design criteria to evaluate the quality of different types of attributes; whether binary or multi-way splits shall be used and how to remove bias from splitting criteria. For recent surveys on this topic we refer to [3], [14], [18].
Despite the large body of work we believe there are still questions to be answered. One of them is to how to properly handle nominal attributes that may assume a large number of values. Before explaining the reason behind our statement we would like to remark that this kind of attribute appears naturally in some applications (e.g. states of a country or letters from some alphabet). In addition, they may arise as the result of aggregating attributes that have few distinct values with the goal of capturing possible correlation between them, as pointed out by Chou [5]. As an example consider 5 binary attributes (e.g. medical tests) and a target binary variable that has large probability of being positive if at least 3 out of the 5 binary tests are positives. By aggregating the 5 binary variables we obtain a new attribute with values that captures this relation. If we used the 5 attributes separately we would need 5 levels in the tree to be able to capture the relation between them and the target class, thus incurring a large fragmentation of the set of samples.
To properly face multi-valued nominal attributes we have to deal with the computational time required to compute good splits. Our contribution is related with this issue.
A brute force search to compute the best binary split requires Ω(2n) time, where n is the number of distinct values the attribute may assume. The computational efficiency can be improved if a n-ary split is used rather than a binary one. However, this may lead to a severe fragmentation of the sample space, which is not desirable: the number of samples available for each of the children of the split node may be small and, as a consequence, the underlying classification tasks may become significantly more difficult. When the target variable is binary, the Gini Gain, proposed in the influential monography by Breiman et al. [4], can be computed efficiently. However, when the number of classes k is larger than 2, most, if not all, of the available exact solutions take exponential time in (n, k). The Twoing method [4], which is equivalent to Gini Gain when is an interesting case since its running time is O(2min {n, k}) rather than O(2n).
When both n and k are large, in the sense that an exhaustive search does not run in a reasonable time, one can rely on heuristics to compute the best binary split. As an example, the GUIDE algorithm [13], the last of a series of algorithms/developments designed by Loh and its contributors, deals with a nominal variable X as follows: if or n ≤ 11 the Gini Index is computed; if k ≤ 11 and n > 20 a new variable X′ with at most k distinct values is created according to a certain rule and an exhaustive search is performed over it; finally, if k > 11 or n ≤ 20, X is binarized and a Linear Discriminant Analysis (LDA) is employed. These rules reflect the difficulty in dealing with multi-valued nominal attributes. In general, the main drawback of using heuristics is the lack of a theoretical guarantee about their behavior.
Given this scenario, we propose a framework for designing criteria, with nice theoretical properties, for evaluating the quality of multi-valued nominal attributes. Criteria generated according to this framework run in polynomial time in n and k and have a theoretical guarantee that they are close to optimal. The key idea consists of formulating the problem of finding the best binary partition for a given attribute A as the problem of finding a cut with maximum weight in a complete graph whose nodes are associated with the values that A may assume and the edges’ weights capture the benefit of putting values in different partitions. The motivation behind the use of the max-cut problem is the existence of efficient algorithms with approximation guarantee, in particular the one proposed by Goemans and Williamson [10], with 0.878 approximation, and local search algorithms with 0.5-approximation [1].
We discuss two criteria that are derived from this framework: the first one can be seen as a natural variation of the Gini Gain, while the second criterion uses the χ2-test to set the edges’ weights. For that, each edge eij, between nodes vi and vj, is thought as a binary attribute A(i, j), with values vi and vj. After discussing these criteria, we show how to extend them to handle numeric attributes.
We also present a number of experiments that suggest that one of our criteria is competitive with the Twoing method, which is – as far as we know – the only well-established criterion with binary splits that can be optimally computed for large n when k > 2. However, in contrast with our methods, Twoing cannot handle datasets that also have a large number of classes. In addition, the experiments also provide evidence of the potential of aggregating attributes for improving the accuracy of decision trees.
There has been some investigation on methods to compute the best split efficiently [4], [5], [7], [8]. For the 2-class problem, Breiman et al. [4] proved a theorem which states that an optimal binary partition, for a certain class of splitting criteria, can be determined in linear time on n, the number of distinct values of the attributes, after ordering. The Gini Gain belongs to this class. The other three papers generalize this theorem in different directions and show necessary conditions that are satisfied by optimal partitions for a certain class of splitting criteria. These conditions, though useful to restrict the set of partitions that need to be considered, do not yield a method that is efficient (polynomial time) for large values of n and k. These papers also present heuristics, without approximation guarantee, to obtain good splits.
Other proposals to speed up the attribute selection phase include [16], [21]. The first presents a simple heuristic to reduce the number of binary splits considered to choose the best nominal variable among the m available ones. The second extends the method for another class of impurity measures.
In order to properly handle nominal attributes with a large number of values, apart from efficiently computing good splits, it is important to prevent bias in the attribute selection. Indeed, it is widely known that many splitting criteria have bias toward attributes with a large number of values. There are some proposals available to cope with this issue [9], [11], [22]. This topic, though relevant, is not the focus of our paper.
Section snippets
Notation and background
We adopt the following notation. Let S be a set of N samples and be the domain of the class label. In addition, for an attribute A, we use A(s) to denote the value taken by attribute A on sample s; we use to denote the set of values taken by A; Aij to refer to the number of samples from class cj for which A takes value vi; Ni for the number of samples with value vi for attribute A and Sj for the number of samples from class cj. Furthermore, we let and
A framework for generating splitting criteria for multi-value attributes
In this section we explain our approach to building binary splitting criteria for multi-valued nominal attributes.
Let A be a nominal attribute that takes values in the domain . Our framework to produce a splitting criterion I consists of three steps:
- 1.
Create a complete graph with n vertexes.
- 2.
Assign a non-negative weight wij to the edge that connects vi to vj. This value shall reflect the benefit of putting vi and vj in different partitions. Different definitions of wij yield to
Experimental evaluation
In this section we describe our experimental study. First, we describe the chosen datasets. Next, we discuss the max-cut algorithms employed and, then, we present our results.
All experiments described in the following sections were executed on a machine with the following settings: Intel(R) Core(TM) i7-4790 CPU @ 3.60 GHz with 32GB of RAM. The code was developed using Python 3.6.1 with the libraries numpy, scipy, scikit-learn and cvxpy. The project can be accessed in
Final remarks
In this paper we proposed a framework for designing splitting criteria for handling multi-valued nominal attributes. Criteria derived from our framework can be implemented to run in polynomial time in n and k, with theoretical guarantee of producing a split that is close to the optimal one.
Experiments over 11 datasets suggest that the GLχ2 criterion, obtained from our framework, is competitive with the well-established Twoing criterion in terms of both accuracy and speed for datasets with a
Acknowledgments
The first author is partially supported by CNPq, grant 477946/2013-5. The second author is partially supported by CNPq.
References (22)
- et al.
Local max-cut in smoothed polynomial time
ACM STOC
(2017) - Austin-Animal-Center, Shelter animal outcomes dataset,...
- et al.
Automatic design of decision-tree induction algorithms
Springer Briefs in Computer Science
(2015) - et al.
Classification and Regression Trees
(1984) Optimal partitioning for classification and regression trees
IEEE Trans. Pattern Anal. Mach. Intell.
(1991)- CMU, CMU pronouncing dictionary,...
- et al.
Partitioning nominal attributes in decision trees
Data Min. Knowl. Discov.
(1999) - et al.
Minimum impurity partitions
Ann. Stat.
(1992) - et al.
Bias correction in classification tree construction
- et al.
Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming
JACM
(1995)
Unbiased recursive partitioning
J. Comput. Graph. Stat.
Cited by (8)
Sampling scheme-based classification rule mining method using decision tree in big data environment
2022, Knowledge-Based SystemsCitation Excerpt :Hu et al. [36] presented a three-phase multi-criteria classification framework for spare parts management using the dominance-based rough set approach to acquire the “IF-THEN” classification rules. Second, Laber et al. [37] designed an efficient splitting criterion with a theoretical approximation guarantee that could handle multi-valued nominal attributes for classification tasks with a large number of classes. Mahan et al. [38] presented a chi-squared criterion-based splitting feature selection strategy for fuzzy decision tree construction for the classification of stream data.
Chi-MFlexDT:Chi-square-based multi flexible fuzzy decision tree for data stream classification
2021, Applied Soft ComputingCitation Excerpt :Finally, in Section 5, conclusions are drawn. Decision trees are one of the best and most commonly considered approach to data display and classification [6,8,15,16]. There are several algorithms for making decision trees in static data.
3.31 - Decision Tree Modeling
2020, Comprehensive Chemometrics: Chemical and Biochemical Data Analysis, Second Edition: Four Volume SetCost-sensitive classification algorithm combining the Bayesian algorithm and quantum decision tree
2023, Frontiers in PhysicsProcessing of Multi-valued Attributes Based on Sparse Matrix
2022, ICIIBMS 2022 - 7th International Conference on Intelligent Informatics and Biomedical SciencesA Practical Tutorial for Decision Tree Induction
2021, ACM Computing Surveys