Splitting criteria for classification problems with multi-valued attributes and large number of classes

doi:10.1016/j.patrec.2018.04.013

Pattern Recognition Letters

Volume 111, 1 August 2018, Pages 58-63

https://doi.org/10.1016/j.patrec.2018.04.013 Get rights and content

Highlights

•
A framework for designing splitting criteria for handling multi-valued attributes.
•
An efficient variation of Gini Gain with an approximation guarantee of optimal value.
•
Experimental evidence that our new criteria are competitive with Twoing criterion.
•
Experimental evidence of the usefulness of aggregating nominal attributes.

Abstract

Decision Trees and Random Forests are among the most popular methods for classification tasks. Two key issues faced by these methods are: how to select the best attribute to associate with a node and how to split the samples given the selected attribute. This paper addresses an important challenge that arises when nominal attributes with a large number of values are present: the computational time required to compute splits of good quality. We present a framework to generate computationally efficient splitting criteria that handle, with theoretical approximation guarantee, multi-valued nominal attributes for classification tasks with a large number of classes. Experiments with a number of datasets suggest that a method derived from our framework is competitive in terms of accuracy and speed with the Twoing criterion, one of few criteria available that is able to handle, with optimality guarantee, nominal attributes with a large number of distinct values. However, this method has the advantage of also efficiently handling datasets with a large number of classes. These experiments also give evidence of the potential of aggregating attributes to improve the classification power of decision trees.

Introduction

Decision Trees and Random Forests are among the most popular methods for classification tasks. Decision Trees, specially small ones, are easy to interpret, while Random Forests usually yield more accurate classifications. One of the key issues in these methods is how to select an attribute to associate with a node of the tree/forest. An important related issue is how to split the samples once the attribute is selected.

There is a number of papers discussing aspects related with attribute selection, such as: how to design criteria to evaluate the quality of different types of attributes; whether binary or multi-way splits shall be used and how to remove bias from splitting criteria. For recent surveys on this topic we refer to [3], [14], [18].

Despite the large body of work we believe there are still questions to be answered. One of them is to how to properly handle nominal attributes that may assume a large number of values. Before explaining the reason behind our statement we would like to remark that this kind of attribute appears naturally in some applications (e.g. states of a country or letters from some alphabet). In addition, they may arise as the result of aggregating attributes that have few distinct values with the goal of capturing possible correlation between them, as pointed out by Chou [5]. As an example consider 5 binary attributes (e.g. medical tests) and a target binary variable that has large probability of being positive if at least 3 out of the 5 binary tests are positives. By aggregating the 5 binary variables we obtain a new attribute with $2^{5} = 32$ values that captures this relation. If we used the 5 attributes separately we would need 5 levels in the tree to be able to capture the relation between them and the target class, thus incurring a large fragmentation of the set of samples.

To properly face multi-valued nominal attributes we have to deal with the computational time required to compute good splits. Our contribution is related with this issue.

A brute force search to compute the best binary split requires Ω(2ⁿ) time, where n is the number of distinct values the attribute may assume. The computational efficiency can be improved if a n-ary split is used rather than a binary one. However, this may lead to a severe fragmentation of the sample space, which is not desirable: the number of samples available for each of the children of the split node may be small and, as a consequence, the underlying classification tasks may become significantly more difficult. When the target variable is binary, the Gini Gain, proposed in the influential monography by Breiman et al. [4], can be computed efficiently. However, when the number of classes k is larger than 2, most, if not all, of the available exact solutions take exponential time in (n, k). The Twoing method [4], which is equivalent to Gini Gain when $k = 2,$ is an interesting case since its running time is O(2^{min {n, k}}) rather than O(2ⁿ).

When both n and k are large, in the sense that an exhaustive search does not run in a reasonable time, one can rely on heuristics to compute the best binary split. As an example, the GUIDE algorithm [13], the last of a series of algorithms/developments designed by Loh and its contributors, deals with a nominal variable X as follows: if $k = 2$ or n ≤ 11 the Gini Index is computed; if k ≤ 11 and n > 20 a new variable X′ with at most k distinct values is created according to a certain rule and an exhaustive search is performed over it; finally, if k > 11 or n ≤ 20, X is binarized and a Linear Discriminant Analysis (LDA) is employed. These rules reflect the difficulty in dealing with multi-valued nominal attributes. In general, the main drawback of using heuristics is the lack of a theoretical guarantee about their behavior.

Given this scenario, we propose a framework for designing criteria, with nice theoretical properties, for evaluating the quality of multi-valued nominal attributes. Criteria generated according to this framework run in polynomial time in n and k and have a theoretical guarantee that they are close to optimal. The key idea consists of formulating the problem of finding the best binary partition for a given attribute A as the problem of finding a cut with maximum weight in a complete graph whose nodes are associated with the values that A may assume and the edges’ weights capture the benefit of putting values in different partitions. The motivation behind the use of the max-cut problem is the existence of efficient algorithms with approximation guarantee, in particular the one proposed by Goemans and Williamson [10], with 0.878 approximation, and local search algorithms with 0.5-approximation [1].

We discuss two criteria that are derived from this framework: the first one can be seen as a natural variation of the Gini Gain, while the second criterion uses the χ²-test to set the edges’ weights. For that, each edge e_ij, between nodes v_i and v_j, is thought as a binary attribute A(i, j), with values v_i and v_j. After discussing these criteria, we show how to extend them to handle numeric attributes.

We also present a number of experiments that suggest that one of our criteria is competitive with the Twoing method, which is – as far as we know – the only well-established criterion with binary splits that can be optimally computed for large n when k > 2. However, in contrast with our methods, Twoing cannot handle datasets that also have a large number of classes. In addition, the experiments also provide evidence of the potential of aggregating attributes for improving the accuracy of decision trees.

There has been some investigation on methods to compute the best split efficiently [4], [5], [7], [8]. For the 2-class problem, Breiman et al. [4] proved a theorem which states that an optimal binary partition, for a certain class of splitting criteria, can be determined in linear time on n, the number of distinct values of the attributes, after ordering. The Gini Gain belongs to this class. The other three papers generalize this theorem in different directions and show necessary conditions that are satisfied by optimal partitions for a certain class of splitting criteria. These conditions, though useful to restrict the set of partitions that need to be considered, do not yield a method that is efficient (polynomial time) for large values of n and k. These papers also present heuristics, without approximation guarantee, to obtain good splits.

Other proposals to speed up the attribute selection phase include [16], [21]. The first presents a simple heuristic to reduce the number of binary splits considered to choose the best nominal variable among the m available ones. The second extends the method for another class of impurity measures.

In order to properly handle nominal attributes with a large number of values, apart from efficiently computing good splits, it is important to prevent bias in the attribute selection. Indeed, it is widely known that many splitting criteria have bias toward attributes with a large number of values. There are some proposals available to cope with this issue [9], [11], [22]. This topic, though relevant, is not the focus of our paper.

Section snippets

Notation and background

We adopt the following notation. Let S be a set of N samples and $C = {c_{1}, \dots, c_{k}}$ be the domain of the class label. In addition, for an attribute A, we use A(s) to denote the value taken by attribute A on sample s; we use $V = {v_{1}, \dots, v_{n}}$ to denote the set of values taken by A; A_ij to refer to the number of samples from class c_j for which A takes value v_i; N_i for the number of samples with value v_i for attribute A and S_j for the number of samples from class c_j. Furthermore, we let $p_{j} = S_{j} / N$ and $p_{i j} = P r [C = c_{j} |$

A framework for generating splitting criteria for multi-value attributes

In this section we explain our approach to building binary splitting criteria for multi-valued nominal attributes.

Let A be a nominal attribute that takes values in the domain $V = {v_{1}, \dots, v_{n}}$ . Our framework to produce a splitting criterion I consists of three steps:

1.
Create a complete graph $G = (V, E)$ with n vertexes.
2.
Assign a non-negative weight w_ij to the edge that connects v_i to v_j. This value shall reflect the benefit of putting v_i and v_j in different partitions. Different definitions of w_ij yield to

Experimental evaluation

In this section we describe our experimental study. First, we describe the chosen datasets. Next, we discuss the max-cut algorithms employed and, then, we present our results.

All experiments described in the following sections were executed on a machine with the following settings: Intel(R) Core(TM) i7-4790 CPU @ 3.60 GHz with 32GB of RAM. The code was developed using Python 3.6.1 with the libraries numpy, scipy, scikit-learn and cvxpy. The project can be accessed in

Final remarks

In this paper we proposed a framework for designing splitting criteria for handling multi-valued nominal attributes. Criteria derived from our framework can be implemented to run in polynomial time in n and k, with theoretical guarantee of producing a split that is close to the optimal one.

Experiments over 11 datasets suggest that the GLχ² criterion, obtained from our framework, is competitive with the well-established Twoing criterion in terms of both accuracy and speed for datasets with a

Acknowledgments

The first author is partially supported by CNPq, grant 477946/2013-5. The second author is partially supported by CNPq.

References (22)

O. Angel et al.
Local max-cut in smoothed polynomial time
ACM STOC
(2017)
Austin-Animal-Center, Shelter animal outcomes dataset,...
R.C. Barros et al.
Automatic design of decision-tree induction algorithms
Springer Briefs in Computer Science
(2015)
L. Breiman et al.
Classification and Regression Trees
(1984)
P.A. Chou
Optimal partitioning for classification and regression trees
IEEE Trans. Pattern Anal. Mach. Intell.
(1991)
CMU, CMU pronouncing dictionary,...
D. Coppersmith et al.
Partitioning nominal attributes in decision trees
Data Min. Knowl. Discov.
(1999)
D. Burshtein et al.
Minimum impurity partitions
Ann. Stat.
(1992)
A. Dobra et al.
Bias correction in classification tree construction
M.X. Goemans et al.
Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming
JACM
(1995)

T. Hothorn et al.

Unbiased recursive partitioning

J. Comput. Graph. Stat.

(2006)

Cited by (8)

Sampling scheme-based classification rule mining method using decision tree in big data environment
2022, Knowledge-Based Systems
Citation Excerpt :
Hu et al. [36] presented a three-phase multi-criteria classification framework for spare parts management using the dominance-based rough set approach to acquire the “IF-THEN” classification rules. Second, Laber et al. [37] designed an efficient splitting criterion with a theoretical approximation guarantee that could handle multi-valued nominal attributes for classification tasks with a large number of classes. Mahan et al. [38] presented a chi-squared criterion-based splitting feature selection strategy for fuzzy decision tree construction for the classification of stream data.
Obtaining comprehensible classification rules may be extremely important in many real applications such as data-driven decision-making and classification tasks. Decision-tree methods are powerful and popular tools for acquiring classification rules. However, they do not show good performance, and the base data processing methods lack strong theoretical support in big data scenarios. This study introduces a sampling scheme with and without the replacement of the implementations of decision-tree methods. This method, called sampling-based classification rule mining (SCRM), is designed to improve the adaptation and generalization ability of classification rules in a big-data environment. Sampling without replacement is conducted to refine classification rules using the concept of conflict and coverage rules, while sampling with replacement is applied to determine rule reliability; the reliability approximation property of classification rules is proved by using the law of large numbers. The effectiveness of the SCRM was evaluated and verified using seven UCI datasets. Theoretical analysis and experimental results show that SCRM is generic with good classification ability, thereby improving the classification accuracy of the rules. SCRM has a significant advantage as it provides theoretical and methodological support for the classification rule mining of big data. Therefore, the SCRM can be used in many applications.
Chi-MFlexDT:Chi-square-based multi flexible fuzzy decision tree for data stream classification
2021, Applied Soft Computing
Citation Excerpt :
Finally, in Section 5, conclusions are drawn. Decision trees are one of the best and most commonly considered approach to data display and classification [6,8,15,16]. There are several algorithms for making decision trees in static data.
In recent years, a number of classification algorithms using fuzzy decision trees have been proposed. A key issue encountered in all of these algorithms is how to select the division feature. In traditional algorithms, it is common to use the information gain criterion. There is a claim that the use of information-based criteria leads to an unfair selection of features that have large number of values while not being suitable for division. This problem causes to decrease accuracy of classification in stream data. However, other algorithms for classifying stream data and extracting fuzzy rules have been proposed but the problem of unfairly selecting features with large amount of values still persists. This paper, present a new algorithm to solve problems of selecting splitting feature in fuzzy decision tree for classification of stream data. In proposed algorithm, we extend multi flexible fuzzy decision tree (MFlexDT) with chi-square based fuzzy partitioning of values. In this paper, we also extend traditional chi-square statistical to fuzzy chi-square statistical data distribution. In evaluating the proposed algorithm, tree depth and accuracy are two factors that affect performance. Experimental results show that the new algorithm has been able to improve the performance of other algorithms.
3.31 - Decision Tree Modeling
2020, Comprehensive Chemometrics: Chemical and Biochemical Data Analysis, Second Edition: Four Volume Set
After a tutorial overview of the concept of decision tree modeling for classification and regression, the concept of an hierarchy is considered both from the perspective of traditional decision tree construction for classification and regression, and from the perspective of hierarchical and multi-label classification. The use of decision-tree modeling in chemical and biochemical applications is briefly reviewed.
Cost-sensitive classification algorithm combining the Bayesian algorithm and quantum decision tree
2023, Frontiers in Physics
Processing of Multi-valued Attributes Based on Sparse Matrix
2022, ICIIBMS 2022 - 7th International Conference on Intelligent Informatics and Biomedical Sciences
A Practical Tutorial for Decision Tree Induction
2021, ACM Computing Surveys

View all citing articles on Scopus

View full text

Splitting criteria for classification problems with multi-valued attributes and large number of classes

Highlights

Abstract

Introduction

Section snippets

Notation and background

A framework for generating splitting criteria for multi-value attributes

Experimental evaluation

Final remarks

Acknowledgments

Local max-cut in smoothed polynomial time

ACM STOC

Automatic design of decision-tree induction algorithms

Springer Briefs in Computer Science

Classification and Regression Trees

Optimal partitioning for classification and regression trees

IEEE Trans. Pattern Anal. Mach. Intell.

Partitioning nominal attributes in decision trees

Data Min. Knowl. Discov.

Minimum impurity partitions

Ann. Stat.

Bias correction in classification tree construction

Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming

JACM

Unbiased recursive partitioning

J. Comput. Graph. Stat.