Abstract
Decision trees rank among the most popular and efficient classification methods. They are used to represent rules for recursively partitioning the data space into regions from which reliable predictions regarding classes can be made. These regions are usually delimited by axis-parallel or oblique hyperplanes. Axis-parallel hyperplanes are intuitively appealing and have been widely studied. However, there is still room for exploring different approaches. In this paper, a splitting rule that constructs axis-parallel hyperplanes by computing the Nash equilibrium of a game played at the node level is used to induct a Nash Equilibrium Decision Tree for binary classification. Numerical experiments are used to illustrate the behavior of the proposed method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The field of machine learning is continuously expanding with new algorithms providing means of analyzing large amounts of complex data. The number of fields in which data analysis tools have become relevant through revealing useful information and aiding decision-making has also been increasing, largely due to algorithm’s capabilities and available computational power. In particular, classification methods are used to solve problems in fields ranging from medicine [2], physics [3], chemistry [4], agriculture [5], economics [6], to social sciences [7], etc. Within real-world applications, variations of classical, text-book approaches [8, 9], sometimes tailored to the specificity of the field, are most frequently used. Decision trees (DT [10]) are some of the most popular among these approaches. One of the main reasons for their popularity is their ease of use and inherent explainability: they offer rules that may further help explain data. However, to improve their accuracy, they are often used as base classifiers in ensemble approaches such as boosting or bagging, thus losing their explainability appeal.
Binary classification, the problem of dividing data into two classes, is fundamental to machine learning. Binary classification problems are straightforward to understand and analyze. By focusing on just two classes, the complexity of the problem is reduced, making it easier to interpret the results and draw meaningful conclusions. It also serves as the foundation for more complex classification tasks like multi-class classification. Many algorithms and techniques used in binary classification can be extended to handle multiple classes [11]. Many real-world problems are formulated as binary classification in different fields such as finance [12], healthcare [13], biochemistry [14], cyber-security [15], and others [16]. Binary classification models also benefit from a clear evaluation framework, where performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC) can be easily calculated and understood, a crucial feature for real-world applications.
A new approach to solving the binary classification problem based on decision trees and game theory is proposed in this paper. A decision tree recursively splits data into regions. In the tree representation, nodes correspond to a region in the data space, and a splitting mechanism is used to divide the data into two sub-regions corresponding to the two sub-nodes. The splitting mechanism used defines the regions and the flavor of the tree. This paper proposes a novel splitting mechanism for node data based on the Nash equilibrium concept [17]. A game between the two classes is designed, with the purpose of identifying an equilibrium hyperplane to separate the data. The Nash equilibrium is thus used as an alternative to optimization solutions generally used in decision tree induction. The main advantage of using the NE is stability: players who choose the NE have no unilateral incentive for deviation. A classification based on the NE equilibrium can be as good as or even better than one based on optimization. It can also provide a new kind of trade-off that may have beneficial practical implications.
The paper is organized as follows: Section 2 presents general approaches to the classification problem with a focus on decision trees and existing game-theoretic approaches. Section 3 describes the proposed approach, the Nash Equilibria Decision Tree (NE-DT). Numerical experiments presented in Section 4 illustrate the behavior of NE-DT on a set of synthetic and real-world benchmarks for classification. The paper ends with conclusions and further work (Section 5).
2 Related work
The classification problem consist of finding rules for assigning labels, or classes, to data based on existing information provided in a training data set. If only two classes are considered, the problem is called binary and is one of the most important instances of classification problems to be studied as it offers a multitude of theoretical opportunities and applications. Multiple solutions exists for extending an algorithm for binary classification to a multi-class one; for example, a straight forward solution consists in using the one-versus-all approach [18].
Formally, we are given a set of data instances \(\mathcal {X}\subset \mathbb {R}^{n\times d}\). Each instance x in \(\mathcal {X}\) has a label \(y\in \mathcal {Y}\subset \{0,1\}^n\). We want to find a rule \(\mathcal {R}\) that assigns to each instance \(x \in \mathcal {X}\) the label \( \hat{y} \subset \{0,1\}\) as close as possible to its real one \(y\in \mathcal {Y}\). Then, we assume we can use the rule \(\mathcal {R}\) to assign labels to other instances that come from the same source as those in \(\mathcal {X}\).
The solution of the binary classification problem can have various representations [19]. In many models, the rule \(\mathcal {R}\) provides the means to estimate the probability that an instance belongs to one of the classes. Decision trees [20] generate the classification rule by recursively splitting the available data using hyper-planes until regions containing instances having (almost) the same label are delineated. The tree representation is considered intuitive and useful for illustrating the classification rule. The interpretability aspect plays an important role in using decision trees in real-world applications [21,22,23,24]. Because of that, a tool for making DTs usable by non-specialists was proposed in [25].
Decision trees are usually build recursively by choosing at each node level the best attribute(s) based on which node data is best split. Hence, most decision trees use an optimization process during their induction. Whether we are talking about splitting data within a DT node, or in general, a good solution requires a trade-off between precision and predictability, i.e. between performance at training and at test level. Tackling this problem from the optimization point of view has been approached in many ways, for example by constructing and combining different performance indicators [26], by using multi-objective approaches [27, 28], or by including fuzzy concepts [29].
When considering the design and induction of decision trees, there are several criteria that can be used to classify approaches. One of them is the manner in which the tree is inducted, i.e., top-down [8, 30], or bottom-up [31, 32]. In the classic and probably most popular, top-down approach, which is the one that we use in this paper, the tree induction starts with the entire data-set in the root node. Bottom-up approaches usually employ some clustering or community structure method to create leaves and build the tree starting from them. On the other hand, optimal decision trees construct the entire tree as an optimization problem [33].
One of the most important features that characterizes a particular flavor of decision tree is the way it splits data at node level. Most decision trees use a hyperplane to split data. The first ones used axis parallel hyper-planes, involving one attribute at each node level [20]. This attribute is chosen to maximize/minimize some purity indicator such as the entropy or gini index [8]. A direct generalization was the introduction of oblique decision trees [34, 35], where node splitting hyper-planes use several attributes. In this case the induction is more computationally expensive, but in general they are more accurate than axis-parallel ones. Other splitting rules involve polytopes [36] or nonlinear splits [37]. Recently, soft approaches borrowed features from neural networks, such as back-propagation and the use of the sigmoid function to split node data [38, 39]. Another approach is to combine multiple evaluation measures [40].
Regardless of the type of split, a criterion is also required to choose among possible parameters. Usually, this parameter involves evaluating the purity of the split sets with respect to the labels, and whether the approach is based on statistics or information theory, it more or less uses the proportion of instances with different labels in the split data [41]. A variation of the same purity criterion is also used to decide if the data in a node is pure enough and it does not require further splitting, and the node can become a leaf. A node may also become a leaf if a maximum allowed tree depth is reached. The tree size is also subject to decision making from the design point of view as a trade-off between accuracy, computational complexity, and overfitting is desired [8, 42]. At last, but not at least, an important aspect consists in the method used to interpret data from leaves, i.e., the separate regions, in order to make predictions regarding the labels of new instances that belong to that region. Most often a probability based on the proportion of instances with each label in the leaf is computed.
Many other methods for tree construction have been proposed in the literature. A minimum query set that uses linear programming and a genetic algorithm is proposed in [43]. A blockchain-based ID3 Decision tree classification framework for distributed networks is proposed in [44]. Functional trees use a linear combination of features in the nodes [45]. DTs are highly adaptable to different types of data. A nested trees approach for longitudinal data is proposed in [46]. A decision tree for sequences is designed in [47]. In [48] a quantum decision tree is proposed. A decision tree based on granular computing theory is presented in [49]. A decision tree designed for ordinal classification problems is designed in [50].
Optimal decision trees [33] convert the induction of the tree into a mixed-integer optimization problem. This method has been widely extended and used in various applications. The optimal randomized classification tree in [51] is based on continuous optimization using a random decision-making process at each node. An column-generation-based metaheuristic for learning trees for classification is proposed in [52]. Dynamic programming is used for MurTree in [53].
Another category of widely used decision trees are based on fuzzy logic [29]. In [54] the authors extend a Hoeffding Decision Tree with fuzziness for data stream classification. Fuzzy decision trees are used in a three-way classification approach in [55]. A decision tree based on a fuzzy analytic hierarchy process is used for classification of electronic music in [56].
Optimization methods [57] play an important role in machine learning, and particular in classification [58, 59]. As mentioned above, the induction of DTs often uses an underlying optimization method. Depending on the approach, exact methods or heuristics can be used. Examples of heuristic search for machine learning in general are multiple: training multi-layer perceptron with artificial algae algorithm [60], chaos theory in Metaheuristics [61], chaotic golden ratio guided local search for big data optimization [62]. The advantage of using heuristics for optimization in machine learning is that usually they do not require certain mathematical properties for the objective function, thus making them adaptable to various, difficult practical applications. However, they are sometimes subject to theoretical scrutiny, as convergence is most often provided only by means of statistical methods. One of the advantages of the method proposed in this paper is that it does not depend on, or require an optimization heuristic.
There is no doubt that the literature that describes various types of decision trees for classification is rich in methods and applications, offering little room for innovation and improvement. However, the fact that there was a continuous need for new flavors of decision trees, not only for different applications, but also designed in general for classification, indicates that there is still a gap to be filled. Although most avenues have been thoroughly explored, considering the complexity of the problem, the exploration of using different solution concepts may still provide the decision maker with new perspectives over the solution space. Game theory, as a field, models strategic, or conflicting situations, providing various solutions with strong theoretical background. In this paper we aim to explore the use of Nash equilibrium as a direct solution for binary classification.
Nash Equilibrium for classification
A normal form game is defined by a set of players, a set of actions available to each player, and a set of payoff functions. The payoff of each player depends on the actions of all players. The Nash equilibrium is defined as a situation of a game in which no player has incentives for unilateral deviation [63]. It represents a stable trade-off in conflicting situations, which makes it an ideal candidate for problems that combine multiple, contradictory objectives such as the classification problem. In spite of that, and of the fact that there is a large amount of work relating game theory and machine learning [64], there is very little research regarding the use of NE directly as a solution for classification.
An attempt to model the problem as a game, based on SVMs, is presented in [65] and uses the generalized Nash equilibrium as solution for binary and multi-class classification problems. Players representing different labels attempt to minimize their distance to the separating hyperplane. In [66], a tree configuration game is used in stream mining systems. Another application uses a game approach for image classification with a Markov random field model [67]. In [68] a normal form game among data instances is designed and used to estimate parameters of probabilistic binary classification models. Apart from these approaches, in most cases, the NE or other game-theoretic appropriate solution concepts are used to process results and select solutions after the a classification method has been applied.
Most of such applications are found within the adversarial classification framework [69] in which usually two players are involved. There are two motivations for this: (i) the one mentioned here before, that the properties of the NE are particularly suitable to address conflicting situations; and (ii) there are many methods available to compute or approximate the NE of a two-player game. The first one is generally applicable to any problem involving decision making among agents with different objectives. The second one, related to computational aspects related to determining equilibria is the one that usually limits the practical application of this concept on a wider scale.
Adversarial models vary largely in applicability. In intrusion detection, a game is designed between the intruder and the intrusion detection mechanism represented by a binary classification method [70]. Another application, in fault diagnosis [71], uses the NE during a convolutional neural network training. In [72] the authors study the behavior of defenders and attackers under equilibrium; they show that from the attacker part, a Fast Gradient method and from the defender Randomized Smoothing form an equilibrium that can be approximated by using a finite number of samples from the distribution. A game for VoIP detection is designed in [73]. The adversarial setting also allows theoretical studies of equilibrium [74]. However, this paper is not concerned with adversarial models, but with the use of the Nash equilibrium directly as a solution concept within the classification model. For the induction of decision trees, to the best of our knowledge, there have been no attempts to use the NE directly to split node data.
3 Nash equilibria - decision tree - NE-DT
The Nash equilibrium concept is used to compute the parameters for the splitting hyperplane at node level. A splitting game is designed between the two classes. The goal is to find hyperplane parameters that would shift instances with the same label as farther apart from the instances with the other label in order to make them more easy to separate.
3.1 Splitting game
A two-player game \(\Gamma {(X,y \mid j)}\) is designed to split data X, having corresponding labels y, based on attribute j:
-
the players are represented by the two sub-nodes, left-L and right-R, respectively;
-
the strategy of each player is to propose a hyperplane parameter \(\beta ^L\) and \(\beta ^R\), respectively; \(\beta ^L, \beta ^R \in \mathbb {R}^2\)
-
the payoff to be minimized by each player is designed aiming to shift instances with different labels as far as possible from each other while searching for the lowest parameters possible:
$$\begin{aligned} \begin{array}{lll} \displaystyle u^L(\beta ^L, \beta ^R)& = & \sum _{{x_{i}\in X, y_i=0}} (x_{ij}\beta _{j,1} + \beta _{j,0}) + \Vert \beta ^L\Vert ^2, \\ \displaystyle u^R (\beta ^L, \beta ^R)& =& - \sum _{{x_{i}\in X,y_i=1}} (x_{ij}\beta _{j,1}j+ \beta _{j,0}) + \Vert \beta ^R\Vert ^2. \end{array} \end{aligned}$$(1)The value of \(\beta _j=(\beta _{j,0}, \beta _{j,1})\in \mathbb {R}^2\) is computed as
$$\beta _j=\frac{1}{2}(\beta ^L+\beta ^R),$$and \(\Vert \beta \Vert ^2=\Vert (\beta _0,\beta _1)\Vert ^2=\beta _0^2+\beta _1^2\).
The payoffs of game \(\Gamma \) are designed in such a manner that the left player L proposes a strategy \(\beta ^L\) that minimizes the dot product between all instances with label 0 and the right player R maximizes the similar product between instances with label 1 and its strategy \(\beta ^R\). In fact, each player proposes a parameter that would group the data with the same label as far away as possible from the data with the other label. A compromise is proposed by considering the average of their strategies as hyperplane parameters.
Nash equilibrium
The Nash equilibrium [17, 63] is one of the most popular solution concepts in game theory as it is designed to offer a type of stability that is appealing in practical, conflicting, situations. A NE is a situation of the game, i.e., a choice of strategies for each player, such that none of them has an unilateral incentive for deviation. Thus, in a NE, none of the players can gain anything by changing their strategies while the others maintain theirs. The Nash equilibrium for game \(\Gamma \) is represented by strategies \(\beta ^L\) and \(\beta ^R\) such that there is no change in either of them that would lead to a better payoff while the other player maintains its choice.
The NE for game \(\Gamma (X,y\mid j)\) can be computed by solving the following optimization problems:
and
By re-writing the payoff functions \(u^L\) and \(u^R\) and replacing \(\beta _j\) we have:
and
where \(n_0\) represents the number of instances in X with label 0, and \(n_1\) the number of instances with label 1, respectively.
If we start with the left node and write the partial derivatives with respect to \(\beta _0^L\) and \(\beta _1^L\) we get:
and
It follows that the optimal (minimum) \(\beta ^{L^\star }\) is:
In a similar manner, for the right node we have \(\beta ^{R^\star }\):
Thus, the NE of game \(\Gamma (X,y,j)\) is \((\beta ^{L^\star },\beta ^{R^\star })\). If follows that \(\beta _j^\star =(\beta _{j,0}^\star ,\beta _{j,1}^\star )\), computed as \(\beta _j^\star =\frac{1}{2}(\beta ^{L^\star }+\beta ^{R^\star })\) has the components:
The game \(\Gamma (X,y\mid j)\) provides the equilibrium parameter \(\beta _j^\star \) that is used to separate node data: instead of splitting node data based on \(X_j\), the split will be based on \(X_j\beta _{j,1}^\star +\beta _{j,0}^\star ,\), as they are shifted based on the equilibrium of the game. This means that there is no other possible strategy \(\beta ^{L^\star }\) and \(\beta ^{R^\star }\) that would lead to a better payoff for either players while the other maintains its strategy unchanged.
Splitting hyper-plane
The equilibrium of game \(\Gamma (X, y\mid j)\) is used to shift instances with different labels away from each other. Therefore, in order to construct the splitting hyperplane for a node with data X, y based on attribute j, we first compute the parameter \(\beta _j\) as the equilibrium \(\Gamma (X,y\mid j)\). Using \(\beta _j\) we associate to each instance \(x_i\) the corresponding value \(x_{ij}\beta _{j,1}+\beta _{j,0}\). Let:
The equilibrium of game \(\Gamma (X,y\mid j)\), \(\beta _j\), ensures that the sum of elements of \(\tilde{X}_j^0\) is minimized and the sum of elements of \(\tilde{X}_j^1\) is maximized, as the aim of the game is to separate the products of form \(x_{ij}\beta _{j,1}+\beta _{j,0}\) of instances with different labels.
In order to split node data, based on \(X_j\) and \(\beta _j\), we use as splitting point the average of a representative point from \(\tilde{X}_j^0\) and \(\tilde{X}_j^1\), respectively. In order to eliminate the effect of outliers in the average, percentiles are used: the \(k^{th}\) percentile, \(P_k(\tilde{X}_j^0)\), for \(\tilde{X}_j^0\), and the \((1-k)^{th}\), percentile \(P_{(1-k)}(\tilde{X}_j^1)\), for \(\tilde{X}_j^1\), respectively. The splitting point \(\tilde{\beta }_j\) is computed as:
and by using \(\beta _j\) and \(\tilde{\beta }_j\) the splitting (axis parallel) hyper-plane can be defined as
The rule for splitting data in X based on attribute j in data for the left sub-node \(X^L_j\), and for the right sub-node \(X^R_j\), respectively is:
with the corresponding set of labels:
Example 1
Consider attribute \(X_j\) (\(j=1\)) of a data set with 50 instances and two labels (0 and 1), represented in Fig. 1, left, and generated by using the function make_classification from the scikit-learnFootnote 1 Python package, with a class separation parameter of 0.5. The right figure represents the two shifted sets \(\tilde{X}_j^0\) and \(\tilde{X}_j^1\), as well as the corresponding splitting point, illustrating the way the NE of the game, \(\beta \), and \(\tilde{\beta }\) shift data instances and determine the equilibrium based splitting hyperplane.
Choosing the splitting attribute
In order to chose the attribute based on which data in the node is split, we compute \(\beta _j\), \(\tilde{\beta }_j\), and the corresponding sets \(X_j^L, y_j^l\) and \(X_j^R, y_j^R\) for each attribute \(j=\overline{1,d}\). Any criterion for evaluating the splits, e.g., entropy, gini, etc., can be used to compute the quality C(X, y, j) of the split and chose the best attribute \(j^*\) to be used for the split based on its value.
GameSplit node data
The GameSplit algorithm for computing the splitting hyperplane for a node is outlined in Algorithm 1. For each attribute j, we compute \(\beta _j\) and \(\tilde{\beta }_j\) using (6) and (8), respectively, followed by the sets \(X_j^L, y_j^L\) and \(X_j^R, y_j^R\). The attribute that best splits the data, based on some usual splitting criterion, using parameters determined as above, is chosen for the split. The rule of the node is based on attribute \(j^*\) and splitting parameters \(\beta _{j^*}\) and \(\tilde{\beta }_{j^*}\) based on (9).
3.2 Nash equilibria - decision tree
The Nash Equilibria Decision Tree is constructed by using a top-down approach. The root node starts with the initial set of data and the tree is build recursively. The outline of building a NE-DT is presented in Algorithm 2. At each node level, if the node does not become a leaf, the splitting hyperplane is computed using the GameSplit procedure in Algorithm 1. A node becomes a leaf if: (i) it has only one element, or (ii) the maximum depth of the tree has been reached, or (iii) it is pure, i.e., all data instances have the same class (Algorithm 2, line 3 ).
NE-DT uses two parameters: MaxDepth, the maximum tree depth, and the percentile \(\kappa \), used to decide what proportion of data are left out when computing the splitting point in GameSplit (line 6, Algorithm 1). NE-DT starts with the entire data-set at the root node, i.e. \(depth=0\); parameters \(\kappa \) and MaxDepth are problem dependent.
4 Numerical experiments
The numerical experiments illustrate the performance of NE-DT on various synthetic and real-world benchmarks, compared to other decision tree-based approaches.
4.1 Experimental set-up
Two sets of benchmarks are used to evaluate the performance of NE-DT: a set of synthetically generated datasets for which the degree of difficulty can be set and a set of real-world benchmarks from the UCI Machine Learning Repository [1].
Synthetic data
The synthetic data sets were generated considering different numbers of instances, attributes, and degrees of overlap for the two classes. To ensure reproducibility and control over the generation of the test data we used the make_classification function from the scikit-learnFootnote 2 Python library [75]. The following parameters are combined to generate the data: the number of instances (100, 250, 500, 1000, 1500, 2000, 5000, 7500, 10000), the number of features/attributes (2, 3, 5, 10, 20, 30, 50), the number of classes (2), the class separator (0.1, 0.2, 0.5, 1), the weight of each class (0.5), and the seed (500). The combination of these different parameters results in 252 different synthetic data sets with variable degrees of difficulty. The class separator parameter controls the degree of overlap of the classes, with lower values generating more difficult data sets.
Real world benchmarks
The real-world data sets were taken from the UCI Machine Learning Repository [1], requiring binary classification of data. We use the following data sets: iris data set (R1) from which we removed the setosa instances in order to obtain a linear non-separable binary classification problem, the Pima Indians Diabetes data set (R2) with 768 instances and 8 attributesFootnote 3, the Connectionist Bench (Sonar, Mines vs. Rocks) data set (R3) which has 208 instances and 60 attributes, the banknote authentication data set (R4) with 1372 instances and 4 attributes, the Somerville Happiness Survey data set (R5) with 143 instances and 7 attributes, and Haberman’s Survival data set (R6) with 306 instances and 3 attributes. The difficulty of the classification is illustrated by known results reported by state-of-the-art methods; we expect a data set for which we find in literature smaller AUC values to be more difficult.
Comparison with other methods
We compare the results of NE-DT with other tree-based classifiers in order to place the performance of NE-DT among other similar classification models. Thus, we use several variants of a Decision Tree (DT), with different hyperparameter values [20], and an Oblique Decision Tree [35], which uses oblique hyperplanes to split node data.
Performance evaluation
To evaluate the expected prediction error, we use the stratified k-fold cross-validation strategy [19]. Each data set is split into k equal sized folds; \(k-1\) of them are used to fit/train the model, and the \(k^{th}\) to test/predict. We repeat this procedure k times, such that all k folds are used once as test folds and evaluate the results for the k predictions. Each fold is roughly balanced; i.e., it contains approximately the same number of instances from each class.
Three performance metrics are used to evaluate the results reported by the models for each test fold: the AUC (area under the ROC curve) [76], the \(F_1\) score [19], and the accuracy (ACC). The AUC can be viewed as the probability that a randomly selected positive sample is ranked higher than a randomly selected negative sample [77]. The \(F_1\) score is a popular metric for binary classification, it is the harmonic mean of the precision and recall metrics. The \(F_1\) score is independent of the number of samples correctly classified as negative and varies for class swapping. The ACC measures the fraction of correctly identified samples. The three metrics have values between 0 and 1, they all indicate the best result at 1, and they can be used to compare the quality of the results. The Pima diabetes data set is slightly unbalanced (0.35/0.65). However, since we use the same performance metrics for all methods, we can still compare the results without considering separate indicators.
Thus, each data set is divided into k-folds and the values of the three performance metrics for each test fold are recorded. The splitting process is repeated several times, with different random seed generators in order to gather enough data for comparisons. The synthetic data sets are then grouped to be analyzed based on different criteria (number of instances, attributes, etc.).
Parameter settings
NE-DT uses two parameters: maximum tree depth MaxDepth, and percentile \(\kappa \) used to split data. We present the behavior of NE-DT when using a MaxDepth of 5 and 10, and \(\kappa =0.2\); as split criterion both gini and entropy. The effect of \(\kappa \) on results is further investigated by testing the values from 0 to 0.9, with step 0.1.
For the other classifiers, we tested different hyper-parameter values. For the DT both gini and entropy are tested as split criteria, and the maximum tree depth is set to 0, 5 and 10. A depth of 0 means that the tree splits the data until all the leaves are pure (contain instances of only one class). For the oblique DT the maximum depth is set to 5 and 10.
For each data set we used stratified 10-fold cross-validation (\(k=10\) folds), and repeated the process 12 times using different random seeds. The StratifiedKFold function from scikit-learn with the following seeds: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 60 was used. Thus, for each dataset, we trained/tested 120 folds.
Statistical comparisons
The statistical analysis of the results is conducted in two manners: (i) by comparing results reported on individual folds for all data sets; (ii) by comparing averaged results for each dataset.
In the first case, data sets are compared based on the values reported on each fold (120 values for each dataset). In this case, mostly due to size, we found that we can consider the data to be normally distributed, so we can use a one-way ANOVA test for the comparison of means. If differences in means between populations can be considered, we follow with a Tukey post-hoc test with letters to rank results. The Tukey post-hoc test assigns a letter to each method by comparing means and ranking them in groups that are significantly different, starting with the letter “a” for the group reporting the best results (highest indicator value); an algorithm may belong to several groups. We also report the number of instances that NE-DT results were assigned to each letter/group category.
In the second approach, we average indicator values for each data set and compare the results (all methods on all data sets). We find that in this situation we cannot consider values to be normally distributed. The Friedman test for medians with a post-hoc Nemenyi analysis where applicable was used, as recommended in [78].
4.2 Results
The numerical results are presented as: CDF plots to illustrate differences between methods; bar-plots for the Tukey letters counting the number of times NE-DT was assigned to each group, and Nemenyi post-hoc Critical distance diagrams to rank methods over all datasets. Numerical values are presented for the real-world benchmarks and for the Nemenyi post-hoc analysis. The section ends with a discussion.
Synthetic benchmarks
Empirical cumulative distribution plots (CDFs) allow visualization and comparison of indicator values grouped by different characteristics of the data sets. A CDF plot shows the proportion of elements that are less than or equal a given point. Thus, when comparing CDF curves the rightmost ones can be considered better, as this indicates that there is a smaller probability of taking a smaller value.
Figure 2 compares the results using CDF plots of AUC values obtained on all test folds for all synthetic data sets. Each sub-figure presents distributions for a different value of the parameters used to generate the synthetic data sets. The results reported by all tested variants of a classifier using different combinations of hyperparameters are aggregated resulting four cdf curves: NE-DT (results obtained by all NE-DT classifiers - all combinations of parameters for NE-DT), DT0 (the DT splits nodes until leafs are pure and contain only one class), DT, and oblique DT. By aggregating results we can observe the overall performance of a classifier type and better evaluate the NE-DT on different data sets that present various degrees of difficulty. For example, the sub-figure with title no. instances = 1500 presents the results obtained on each test fold for the synthetic data sets that have 1500 instances and all combinations of the parameters number of features (2, 3, 5, 10, 20, 30, 50) and class separator (0.1, 0.2, 0.5, 1). For this sub-figure, we can observe that our approach, NE-DT, reports better results as the cdf plot for NE-DT is the rightmost plot (for each point of the plot we have a higher probability of obtaining a higher AUC value with the NE-DT classifier than with the other compared classifiers). Figure 2 shows that NE-DT reports better results than the other classifiers for the different parameters of the generated data sets.
To assess the significance of differences among results, we used the one-way ANOVA test to check for differences between means. Since we found that differences exist, we continued the analysis with the Tukey post-hoc test with letters. The results are presented in Fig. 3. Each row contains results related to a different characteristic of the data set (as indicated by the label on the x axis). Each column of figures presents the number of times that NE-DT is placed in that Tukey group, with the leftmost starting with letter “a” indicating the group with the best results. Each individual figure presents the results for all three metrics reported and counts the number of times NE-DT is assigned the Tukey letter corresponding to the column; results are grouped by the parameter of the data set indicated by the row. For example, the figure of the first row and column (the upper left figure) counts how many times NE-DT is assigned the Tukey letter “a” if we group the results based on the number of instances of the data sets. Figure 3 shows that NE-DT is most often placed in the group “a", indicating that it achieves results better or as good as the other methods in most cases.
NE-DT parameters
Parameter \(\kappa \) is used to compute the split point \(\tilde{\beta }_j\) and controls the proportion of data that is left out when computing it 8. In order to evaluate the robustness of NE-DE with respect to \(\kappa \), numerical experiments with values ranging from 0 to 0.9 were performed. Figure 4 presents the ECDF plots of AUC values obtained. It can be observed that when we are not using any percentile, i.e. when \(\kappa =0\), we have the worst results in most cases. However, for values different than zero NE-DT results are very similar, indicating robustness with respect to this parameter. Although there is no significant difference between other settings of \(\kappa \) values (apart from 0), it is reasonable to set it so that it leaves out \(20\%\) of the values for both labels, avoiding \(\tilde{\beta }_j\) being affected by extreme values on either side.
Real-world data sets
Table 1 presents the results reported by NE-DT for the real world data sets. We report the mean and standard deviation for all three metrics for different parameters of NE-DT (maximum tree depth and split criterion). We also report the Tukey letter/rank of NE-DT when we compare its results, for each metric, to the other classifiers. If we look at the AUC indicator, NE-DT reports results that are statistically better (or as good as) than the other classifiers for the data sets R1, R3, R4, R5, and R6 for maximum depth of the three of 10, with no difference in results between the two split criteria. In terms of the \(F_1\) indicator, overall, the proposed approach seems to provide better results when the maximum depth is set to 10. For the R3 and R4 data sets, NE-DT reports statistically better results than the compared classifiers. When we look at the accuracy indicator a maximum depth of 5 yields better results for the R2 data sets, and for the other data sets all parameters produce similar results. For all data tested, except R6, NE-DT reports significantly better results.
Overall comparisons
In order to compare and rank all methods over the entire benchmark sets, the autorank Python package [79] was used, for each indicator and separately for the synthetic and real-world benchmarks. The following encoding for the method names was used to report the results (Table 2):
Table 3 presents the summary of results for synthetic data for each indicator. For each method, we present the median (MED), the median absolute deviation (MAD), and the mean rank (MR), as well as the confidence interval for medians, for the three indicators to illustrate differences among metrics as well. The methods are ranked in descending order, i.e., the last one ranks the best. Figures 5, 6, and 7 present the CD diagrams illustrating the results of the Nemenyi post-hoc test for the reported indicators.
With regard to the AUC indicator, the Nemenyi post-hoc test places the methods in the following groups, in decreasing order of their ranking (Fig. 5):
-
6.
DT (5, gini or entropy);
-
5.
DT (5, gini) and Oblique DT (5 or 10);
-
4.
Oblique DT (5 or 10) and DT (0, gini);
-
3.
Oblique DT (5), DT (0, gini), and DT (10, entropy);
-
2.
NE-DT (10, gini or entropy), DT (0, gini or entropy), and DT (10, gini or entropy);
-
1.
NE-DT (5, gini or entropy).
The number next to each type of tree represents the maximum depth.
Figure 6 indicates the following groups with no significant differences for the \(F_1\) indicator in the following order:
-
6.
NE-DT (5, gini or entropy) and DT (5, gini or entropy);
-
5.
NE-DT (5, gini or entropy), NE-DT (10, gini), and DT (5, gini);
-
4.
NE-DT (10, gini or entropy) and Oblique DT (5 or 10);
-
3.
Oblique DT (5 or 10), DT (10, entropy), and DT (0, gini);
-
2.
DT (10, gini or entropy) and DT (0, gini);
-
1.
DT (0, gini or entropy) and DT (10, gini).
We find that with regard to this indicator, NE-DT results are not as competitive as in the case of the AUC; however, average median values go from 0.671 to 0.714, so this may be a situation in which, while there is significance in differences, the actual numerical differences are not remarkable in value.
Figure 7 represents the corresponding diagram for the ACC values, with the following groups, placing NE-DT results among the middle:
-
5.
DT (5, gini or entropy);
-
4.
NE-DT (5, gini or entropy), NE-DT (10, gini), and DT (5, gini);
-
3.
NE-DT (5 or 10, gini or entropy) and Oblique DT (5 or 10);
-
2.
Oblique DT (5) and DT (0, gini);
-
1.
DT (0, gini or entropy) and DT (10, gini or entropy).
For the real-world datasets we found that the hypothesis for normality cannot be rejected, and that the data is homoscedastic for all indicators. Therefore ANOVA was used again, and we found there is no significant difference among means when comparing all datasets with all the methods \(p=0.597\) for the AUC indicator and \(p=0.386\) for ACC. There was a significant difference in means for the \(F_1\) indicator (\(p=0.026\)), so the test was followed by a Tukey post-hoc analysis that indicated that there are no significant differences within the following groups:
-
2.
NE-DT (5 or 10, gini or entropy), DT (0, gini or entropy), and DT (10, gini);
-
1.
NE-DT (5 or 10, gini), NE-DT (5, entropy), DT (0, 5 or 10, gini or entropy), and Oblique DT (5 or 10).
These results are consistent with those presented in Table 1 that show NE-DT results being assigned mainly letters “a” and “b”.
Discussion
The main goal of our proposal is to explore the use of the Nash equilibrium as a solution concept for classification. NE-DT constructs a decision tree that splits data using this concept. The numerical experiments section compares NE-DT results with other decision trees in order to assess if its performance is competitive and if it may be considered for real-world applications for which the decision trees variants with which we are comparing are considered as a candidate tool.
We compared the two sets of benchmarks both over all the tested folds and over aggregated data for each dataset. The reason for this approach was to provide a comprehensive view over the results. For the fold data we constructed CDF plots that allow the comparison of the distributions of performance indicators. When comparing the performance of NE-DT with the other decision trees on the synthetic data sets, we found AUC curves of NE-DT results to indicate overall higher values. The ANOVA statistic test showed that indeed there are significant differences in results, and a post-hoc Tukey test with letters assigned NE-DT results in the group reporting the best results in the majority of the cases (Fig. 3. In most of the cases where NE-DT was not placed in the group “a”, it appeared in group “b”. By studying the distribution of results, we can assess that the three metrics used are consistent with each other, with small exceptions.
Regarding the number of instances, we find that the performance of NE-DT slightly decreases with increasing values, with more results in letter group “b” for higher number of instances. Regarding the number of attributes, the most interesting aspect is that NE-DT reported the worst results for the smaller number of attributes (2) and better for the higher number (50). This suggests that the equilibrium approach performs better when there is a wider choice of attributes. Results grouped by class overlap are indicative of a disadvantage of NE-DT, as it seems to perform better on problems that are difficult, with a higher degree of overlap (smaller parameter for generating data) but with less precise results on better separated classes. An explanation for this may be due to overfitting, as the maximum depth row shows that results that use the maximum depth of 10 are significantly worse. There is no significant difference between the two criteria for choosing the splitting attribute, gini and entropy.
The Friedmam test, comparing results over all the synthetic datasets, ranks NE-DT best with respect to the AUC score (Fig. 5. However, as far as the \(F_1\) and accuracy scores, results reported by NE-DT rank in the middle of the groups. The tests on the real-world benchmarks results show no significant difference among methods for AUC and ACC; the \(F_1\) score test places NE-DT results in two significant different groups, one of them ranking best. Regarding NE-DT parameters specific for decision trees, we find that they influence results in expected manners, i.e., trees with a higher maximum depth tend to overfit, and there are no notable differences between the two splitting criteria gini and entropy.
Thus, as a conclusion regarding the analyzed data sets and performance indicators, we can assert that the AUC values reported by NE-DT are in most cases better or at least as good as the other tree-based methods with which it was compared. The accuracy values also ranked among the best, while the \(F_1\) score placed NE-DT in the middle. These results indicate that NE-DT would be useful in a setting in which the value of AUC is of importance, i.e., the values of the probabilities for positive instances have practical meaning and are used further.
Regarding the computational time requred to run experiments, there is no significant difference between NE-DT and all other base DT methods that use axis parallel hyperplanes for splitting node data.
5 Conclusions and further research
The main goal of this study was to explore the use of the Nash equilibrium concept as a solution to the classification problem. Choosing the split parameters for a node within a decision tree represents a decision-making problem that requires an output suitable for predictions. In most situations there are infinite possible solutions that lead to the same split, i.e., the same data split in the sub-nodes and subsequently the same values for indicators such as gini or entropy. We propose the choice of a point which is a Nash equilibrium that may provide additional properties while representing a valid solution for the classification problem. The split is chosen in such a manner that no unilateral deviation to any of the sides of the hyperplane would provide a better separation of the node data.
Thus, NE-DT is a decision tree that splits data based on the equilibrium of a two-player game. Designed for the binary classification problem, the equilibrium spreads the instances with different labels, making them easier to separate. The equilibria-based DT variant is tested on a set of synthetic and real-world data, and the results are compared with those of other decision trees that use the same quality indicators to choose splitting attributes for the nodes.
One of the advantages of the game based proposal is that the equilibrium of the node game is computed analytically, making it computationally competitive with other decision trees. The node splitting game actually takes into account within the payoff function the location of the data, not only probabilities, making the method more data-oriented.
There are several research directions that may stem from this approach. The next task which could improve the classification is to further consider designing a game to select attributes at node level using an equilibrium concept. It is known that the greedy choice of an attribute at the lower level of the trees may not lead to optimal results. In this direction, several paths may be explored. The authors would envisage at least two concrete ones: (i) design a game at node level, assigning a payoff function to each attribute based on the equilibrium of the node game for splitting data; (ii) design an extensive form game to tackle attribute choices over the entire tree. Another possible approach to the attribute selection problem is to explore designing game-based oblique decision trees, in which hyperplane parameters represent game equilibria. The efficiency of the computation also makes it possible to explore the use of boosting and bagging techniques based on the equilibrium data split. Moreover, the use of the Nash equilibrium concept may not be limited to decision trees but also extended to other models. Considering other solution concepts, e.g., the strong Nash equilibrium or the generalized Nash equilibrium, is also a path that is worth exploring for future applications.
Availability of data and materials
the synthetic data was generated by using the freely available make_classification function from the scikit-learn (version1.1.1) Python library, and all parameters used to generate it (including seeds) are reported in the paper for reproducibility reasons. They are also included in the attached code. Real-world data are downloaded from the public domain and are available online on the UCI Machine Learning repository [1].
Code availability
the code can be made available on a public repository
Notes
version 1.1.1
version 1.1.1
Available at https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database, last accessed Oct, 2024.
References
Dua D, Graff C (2017) UCI Machine Learning Repository . http://archive.ics.uci.edu/ml
Lee KC, Roy SS, Samui P, Kumar V (2020) Data Analytics in Biomedical Engineering and Healthcare. Academic Press, London, UK. https://doi.org/10.1016/C2018-0-05371-2
Knecht V (2022) AI for Physics. Taylor & Francis, Boca Raton, FL, pp 1–147
Pyzer-Knapp EO, Laino T (2020) Machine Learning in Chemistry: Data-driven Algorithms. ACS symposium series. Am Chem Soc Washington, DC, Learning Systems, And Predictions. https://doi.org/10.1021/bk-2019-1326
Valiya Veettil A, Mishra AK (2023) Quantifying thresholds for advancing impact-based drought assessment using classification and regression tree (cart) models. J Hydrol 129966. https://doi.org/10.1016/j.jhydrol.2023.129966
Dixon MF, Halperin I, Bilokon P (2020) Machine Learning in Finance, p 548. Springer, Gewerbestrasse 11, 6330 Cham, Switzerland. https://doi.org/10.1007/978-3-030-41068-1
Amaturo E, Aragona B (2019) Methods for big data in social sciences. Mathematical Population Studies 26(2):65–68. https://doi.org/10.1080/08898480.2019.1597577. Publisher: Routledge _eprint: https://doi.org/10.1080/08898480.2019.1597577
Zaki MJ, Meira W Jr (2020) Data Mining and Machine Learning: Fundamental Concepts and Algorithms, 2nd edn. Cambridge University Press, New York. https://doi.org/10.1017/9781108564175
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. https://doi.org/10.1007/s10115-007-0114-2. Accessed 21 Nov 2021
Breiman L, Friedman JH, Olshen RA, Stone CJ (2017) Classification And Regression Trees. Chapman and Hall/CRC
Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141
Ma Z, Wang X, Hao Y (2023) Development and application of a hybrid forecasting framework based on improved extreme learning machine for enterprise financing risk. Expert Syst Appl 215:119373. https://doi.org/10.1016/j.eswa.2022.119373
Stolnicu S, Hoang L, Almadani N, De Brot L, Baiocchi G, Bovolim G, Brito MJ, Karpathiou G, Ieni A, Guerra E, Kiyokawa T, Dundr P, Parra-Herran C, Lérias S, Felix A, Roma A, Pesci A, Oliva E, Park KJ, Soslow RA, Abu-Rustum NR (2022) Clinical correlation of lymphovascular invasion and silva pattern of invasion in early-stage endocervical adenocarcinoma: proposed binary silva classification system. Pathol 54(5):548–554. https://doi.org/10.1016/j.pathol.2022.01.007
Micsonai A, Moussong É, Murvai N, Tantos Á, Toke O, Réfrégiers M, Wien F, Kardos J (2023) Disordered-ordered protein binary classification by circular dichroism spectroscopy. Biophys J 122(3, Supplement 1):344. https://doi.org/10.1016/j.bpj.2022.11.1915
Naem AA, Ghali NI, Saleh AA (2018) Antlion optimization and boosting classifier for spam email detection. Futur Comput Inf J 3(2):436–442. https://doi.org/10.1016/j.fcij.2018.11.006
Kumari R, Srivastava SK (2017) Machine learning: A review on binary classification. Int J Comput App 160(7)
Maschler M, Zamir S, Solan E (2020) Game Theory, 2nd edn. Cambridge University Press, New York. https://doi.org/10.1017/9781108636049
Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141
Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning: Data Mining, Inference and Prediction 2nd edn. Springer, ???
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA
Sagi O, Rokach L (2021) Approximating XGBoost with an interpretable decision tree. Inf Sci 572:522–542. https://doi.org/10.1016/j.ins.2021.05.055
Yoo J, Sael L (2021) Gaussian soft decision trees for interpretable feature-based classification. In: Karlapalem K, Cheng H, Ramakrishnan N, Agrawal RK, Reddy PK, Srivastava J, Chakraborty T (eds) Advances in Knowledge Discovery and Data Mining, pp 143–155. Springer, Cham. https://doi.org/10.1007/978-3-030-75765-6_12
Singh Hada S, Carreira-Perpinan MA (2022) Interpretable image classification using sparse oblique decision trees. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 2759–2763.https://doi.org/10.1109/ICASSP43922.2022.9747873
Pagliarini G, Sciavicco G (2023) Interpretable land cover classification with modal decision trees. Eur J Remote Sens 56(1). https://doi.org/10.1080/22797254.2023.2262738
Zografos M, Ougiaroglou S (2024) Simplifying decision tree classification through the autodtrees web application and service. In: Sifaleras A, Lin F (eds) Generative Intelligence and Intelligent Tutoring Systems. Springer, Cham, pp 162–173
Rokach L, Maimon O (2014) Data Mining With Decision Trees: Theory and Applications, 2nd edn. World Scientific Publishing Co., Inc, USA
Chikalov I, Hussain S, Moshkov M (2018) Bi-criteria optimization of decision trees with applications to data analysis. Eur J Oper Res 266(2):689–701. https://doi.org/10.1016/j.ejor.2017.10.021
Chabbouh M, Bechikh S, Hung CC, Said LB (2019) Multi-objective evolution of oblique decision trees for imbalanced data binary classification. Swarm Evol Comput 49:1–22. https://doi.org/10.1016/j.swevo.2019.05.005
Segatori A, Marcelloni F, Pedrycz W (2018) On Distributed Fuzzy Decision Trees for Big Data. IEEE Trans Fuzzy Syst 26(1):174–192. https://doi.org/10.1109/TFUZZ.2016.2646746
Rokach L, Maimon O (2005) Top-down induction of decision trees classifiers - a survey. IEEE Trans Syst Man Cybern Part C (Appl Rev) 35(4):476–487. https://doi.org/10.1109/TSMCC.2004.843247
Barros RC, Jaskowiak PA, Cerri R, de Carvalho ACPLF (2014) A framework for bottom-up induction of oblique decision trees. Neurocomput 135:3–12. https://doi.org/10.1016/j.neucom.2013.01.067
Gu C, Zhang B, Wan X, Huang M, Zou G (2016) The modularity-based hierarchical tree algorithm for multi-class classification. In: 2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp 625–629. https://doi.org/10.1109/SNPD.2016.7515969
Bertsimas D, Dunn J (2017) Optimal classification trees. Mach Learn 106(7):1039–1082. https://doi.org/10.1007/s10994-017-5633-9
Murthy SK, Kasif S, Salzberg S (1994) A system for induction of oblique decision trees. J Artif Intell Res 2:1–32
Wickramarachchi DC, Robertson BL, Reale M, Price CJ, Brown J (2016) Hhcart: An oblique decision tree. Comput Stat Data Anal 96:12–23. https://doi.org/10.1016/j.csda.2015.11.006
Armandpour M, Sadeghian A, Zhou M (2024) Convex polytope trees. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. NIPS ’21. Curran Associates Inc., Red Hook, NY, USA
Li Y, Dong M, Kothari R (2005) Classifiability-based omnivariate decision trees. IEEE Trans Neural Netw 16(6):1547–1560
Xu Z, Zhu G, Yuan C, Huang Y (2022) One-Stage Tree: end-to-end tree builder and pruner. Mach Learn 111(5):1959–1985. https://doi.org/10.1007/s10994-021-06094-4
Irsoy O, Yildiz OT, Alpaydin E (2014) Budding trees. In: Proceedings - International Conference on Pattern Recognition, pp 3582–3587. https://doi.org/10.1109/ICPR.2014.616
Loyola-Gonzalez O, Ramirez-Sayago E, Medina-Perez MA (2023) Towards improving decision tree induction by combining split evaluation measures. Knowl-Based Syst 277. https://doi.org/10.1016/j.knosys.2023.110832
Zhao X, Nie X (2021) Splitting Choice and Computational Complexity Analysis of Decision Trees. Entropy 23(10). https://doi.org/10.3390/e23101241
Amro A, Al-Akhras M, Hindi KE, Habib M, Shawar BA (2021) Instance Reduction for Avoiding Overfitting in Decision Trees. J Intell Syst 30(1):438–459. https://doi.org/10.1515/jisys-2020-0061. Accessed 09 Jul 2022
Wieczorek W, Kozak J, Strak L, Nowakowski A (2021) Minimum Query Set for Decision Tree Construction. Entropy 23(12). https://doi.org/10.3390/e23121682
Yu J, Qiao Z, Tang W, Wang D, Cao X (2021) Blockchain-Based Decision Tree Classification in Distributed Networks. Intell Autom Soft Comput 29(3):713–728. https://doi.org/10.32604/iasc.2021.017154
Canete-Sifuentes L, Monroy R, Medina-Perez MA (2022) FT4cip: A new functional tree for classification in class imbalance problems. Knowl-Based Syst 252. https://doi.org/10.1016/j.knosys.2022.109294
Ovchinnik S, Otero F, Freitas AA (2022) Nested trees for longitudinal classification. In: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing. SAC ’22, pp 441–444. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3477314.3507240
He Z, Wu Z, Xu G, Liu Y, Zou Q (2023) Decision Tree for Sequences. IEEE Trans Knowl Data Eng 35(1):251–263. https://doi.org/10.1109/TKDE.2021.3075023
Lu S, Braunstein SL (2014) Quantum decision tree classifier. Quantum Inf Process 13(3):757–770. https://doi.org/10.1007/s11128-013-0687-5
Meng L, Bai B, Zhang W, Liu L, Zhang C (2023) Research on a Decision Tree Classification Algorithm Based on Granular Matrices. Electr 12(21). https://doi.org/10.3390/electronics12214470
Marudi M, Ben-Gal I, Singer G (2024) A decision tree-based method for ordinal classification problems. IISE Trans 56(9, SI):960–974. https://doi.org/10.1080/24725854.2022.2081745
Blanquero R, Carrizosa E, Molero-Rio C, Morales DR (2021) Optimal randomized classification trees. Comput Oper Res 132. https://doi.org/10.1016/j.cor.2021.105281
Patel KK, Desaulniers G, Lodi A (2024) An improved column-generation-based matheuristic for learning classification trees. Comput Oper Res 165. https://doi.org/10.1016/j.cor.2024.106579
Demirovita E, Lukina A, Hebrard E, Chan J, Bailey J, Leckie C, Ramamohanarao K, Stuckey PJ (2022) Murtree: Optimal decision trees via dynamic programming and search. J Mach Learn Res 23(26):1–47
Ducange P, Marcelloni F, Pecori R (2021) Fuzzy Hoeffding Decision Tree for Data Stream Classification. Int J Comput Intell Syst 14(1):946–964. https://doi.org/10.2991/ijcis.d.210212.001
Han X, Zhu X, Pedrycz W, Li Z (2023) A three-way classification with fuzzy decision trees. Appl Soft Comput 132. https://doi.org/10.1016/j.asoc.2022.109788
Wu H, Zhu L (2024) Adaptive classification method of electronic music based on improved decision tree. Int J Arts Technol 15(1). https://doi.org/10.1504/IJART.2024.137296
Chelouah R, Siarry P (2022) Optimization and Machine Learning: Optimization for Machine Learning and Machine Learning for Optimization. John Wiley & Sons, London, UK
Turkoglu B, Uymaz SA, Kaya E (2022) Binary artificial algae algorithm for feature selection. Appl Soft Comput 120:108630. https://doi.org/10.1016/j.asoc.2022.108630
Turkoglu B, Uymaz SA, Kaya E (2022) Clustering analysis through artificial algae algorithm. Int J Mach Learn Cybern 13(4):1179–1196. https://doi.org/10.1007/s13042-022-01518-6. Accessed 2024-10-23
Turkoglu B, Kaya E (2020) Training multi-layer perceptron with artificial algae algorithm. Eng Sci Technol Int J 23(6):1342–1350. https://doi.org/10.1016/j.jestch.2020.07.001
Turkoglu B, Uymaz SA, Kaya E (2023) Chapter 1 - chaos theory in metaheuristics. In: Mirjalili S, Gandomi AH (eds) Comprehensive Metaheuristics, pp 1–20. Academic Press, London, UK. https://doi.org/10.1016/B978-0-323-91781-0.00001-6
Koçer HG, Türkoğlu B, Uymaz SA (2023) Chaotic golden ratio guided local search for big data optimization. Eng Sci Technol Int J 41:101388. https://doi.org/10.1016/j.jestch.2023.101388
Nash JF (1950) Equilibrium points in n-person games. Proc Natl Acad Sci 36(1):48–49. https://doi.org/10.1073/pnas.36.1.48. Accessed 02 Aug 2022
Rezek I, Leslie DS, Reece S, Roberts SJ, Rogers A, Dash RK, Jennings NR (2008) On similarities between inference in game theory and machine learning. J Artif Int Res 33(1):259–283
Couellan N (2017) A note on supervised classification and nash-equilibrium problems. RAIRO - Oper Res 51(2):329–341. https://doi.org/10.1051/ro/2016024
Park H, Turaga DS, Verscheure O, Van Der Schaar M (2009) Tree Configuration Games for Distributed Stream Mining Systems, pp 1773–1776. https://doi.org/10.1109/ICASSP.2009.4959948
Berthod M, Kato Z, Yu S, Zerubia J (1996) Bayesian image classification using markov random fields. Image and Vision Computing 14(4):285–295. https://doi.org/10.1016/0262-8856(95)01072-6
Suciu MA, Lung RI (2020) Nash equilibrium as a solution in supervised classification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12269 LNCS:539–551. https://doi.org/10.1007/978-3-030-58112-1_37
Dritsoula L, Loiseau P, Musacchio J (2017) A game-theoretic analysis of adversarial classification. IEEE Trans Inf Forensic Sec 12(12):3094–3109. https://doi.org/10.1109/TIFS.2017.2718494
Cheng Y, Fu H, Sun X (2021). Intrusion Detection Based on the Game Theory. https://doi.org/10.1145/3474198.3478267
Zou L, Li Y, Xu F (2020) An adversarial denoising convolutional neural network for fault diagnosis of rotating machinery under noisy environment and limited sample size case. Neurocomput 407:105–120. https://doi.org/10.1016/j.neucom.2020.04.074
Pal A, Vidal R (2020) A game theoretic analysis of additive adversarial attacks and defenses. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20. Curran Associates Inc., Red Hook, NY, USA
Addesso P, Cirillo M, Di Mauro M, Matta V (2020) Advoip: Adversarial detection of encrypted and concealed voip. IEEE Trans Inf Forensic Sec 15:943–958. https://doi.org/10.1109/TIFS.2019.2922398
Yasodharan S, Loiseau P (2019) Nonzero-sum adversarial hypothesis testing games. Curran Associates Inc., Red Hook, NY, USA
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830
Fawcett T (2006) An introduction to ROC analysis. Patt Recognit Lett 27(8):861–874. https://doi.org/10.1016/j.patrec.2005.10.010
Rosset S (2004) Model selection via the AUC. In: Proceedings of the Twenty-First International Conference on Machine Learning. ICML ’04, p 89. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1015330.1015400
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30
Herbold S (2020) Autorank: A python package for automated ranking of classifiers. J Open Sour Softw 5(48):2173. https://doi.org/10.21105/joss.02173
Acknowledgements
This work was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III-P4-ID-PCE-2020-2360, within PNCDI III.
Funding
This work was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III-P4-ID-PCE-2020-2360, within PNCDI III.
Author information
Authors and Affiliations
Contributions
both authors have contributed to all stages of this research in equal proportion.
Corresponding author
Ethics declarations
Conflict of interest/Competing interests (check journal-specific guidelines for which heading to use)
The authors have no competing interests to declare that are relevant to the content of this article.
Ethics approval
not applicable
Consent to participate
not applicable
Consent for publication
not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Suciu, MA., Lung, R.I. A Nash equilibria decision tree for binary classification. Appl Intell 55, 192 (2025). https://doi.org/10.1007/s10489-024-06132-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06132-3