1 Introduction

The field of machine learning is continuously expanding with new algorithms providing means of analyzing large amounts of complex data. The number of fields in which data analysis tools have become relevant through revealing useful information and aiding decision-making has also been increasing, largely due to algorithm’s capabilities and available computational power. In particular, classification methods are used to solve problems in fields ranging from medicine [2], physics [3], chemistry [4], agriculture [5], economics [6], to social sciences [7], etc. Within real-world applications, variations of classical, text-book approaches [8, 9], sometimes tailored to the specificity of the field, are most frequently used. Decision trees (DT [10]) are some of the most popular among these approaches. One of the main reasons for their popularity is their ease of use and inherent explainability: they offer rules that may further help explain data. However, to improve their accuracy, they are often used as base classifiers in ensemble approaches such as boosting or bagging, thus losing their explainability appeal.

Binary classification, the problem of dividing data into two classes, is fundamental to machine learning. Binary classification problems are straightforward to understand and analyze. By focusing on just two classes, the complexity of the problem is reduced, making it easier to interpret the results and draw meaningful conclusions. It also serves as the foundation for more complex classification tasks like multi-class classification. Many algorithms and techniques used in binary classification can be extended to handle multiple classes [11]. Many real-world problems are formulated as binary classification in different fields such as finance [12], healthcare [13], biochemistry [14], cyber-security [15], and others [16]. Binary classification models also benefit from a clear evaluation framework, where performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC) can be easily calculated and understood, a crucial feature for real-world applications.

A new approach to solving the binary classification problem based on decision trees and game theory is proposed in this paper. A decision tree recursively splits data into regions. In the tree representation, nodes correspond to a region in the data space, and a splitting mechanism is used to divide the data into two sub-regions corresponding to the two sub-nodes. The splitting mechanism used defines the regions and the flavor of the tree. This paper proposes a novel splitting mechanism for node data based on the Nash equilibrium concept [17]. A game between the two classes is designed, with the purpose of identifying an equilibrium hyperplane to separate the data. The Nash equilibrium is thus used as an alternative to optimization solutions generally used in decision tree induction. The main advantage of using the NE is stability: players who choose the NE have no unilateral incentive for deviation. A classification based on the NE equilibrium can be as good as or even better than one based on optimization. It can also provide a new kind of trade-off that may have beneficial practical implications.

The paper is organized as follows: Section 2 presents general approaches to the classification problem with a focus on decision trees and existing game-theoretic approaches. Section 3 describes the proposed approach, the Nash Equilibria Decision Tree (NE-DT). Numerical experiments presented in Section 4 illustrate the behavior of NE-DT on a set of synthetic and real-world benchmarks for classification. The paper ends with conclusions and further work (Section 5).

2 Related work

The classification problem consist of finding rules for assigning labels, or classes, to data based on existing information provided in a training data set. If only two classes are considered, the problem is called binary and is one of the most important instances of classification problems to be studied as it offers a multitude of theoretical opportunities and applications. Multiple solutions exists for extending an algorithm for binary classification to a multi-class one; for example, a straight forward solution consists in using the one-versus-all approach [18].

Formally, we are given a set of data instances \(\mathcal {X}\subset \mathbb {R}^{n\times d}\). Each instance x in \(\mathcal {X}\) has a label \(y\in \mathcal {Y}\subset \{0,1\}^n\). We want to find a rule \(\mathcal {R}\) that assigns to each instance \(x \in \mathcal {X}\) the label \( \hat{y} \subset \{0,1\}\) as close as possible to its real one \(y\in \mathcal {Y}\). Then, we assume we can use the rule \(\mathcal {R}\) to assign labels to other instances that come from the same source as those in \(\mathcal {X}\).

The solution of the binary classification problem can have various representations [19]. In many models, the rule \(\mathcal {R}\) provides the means to estimate the probability that an instance belongs to one of the classes. Decision trees [20] generate the classification rule by recursively splitting the available data using hyper-planes until regions containing instances having (almost) the same label are delineated. The tree representation is considered intuitive and useful for illustrating the classification rule. The interpretability aspect plays an important role in using decision trees in real-world applications [21,22,23,24]. Because of that, a tool for making DTs usable by non-specialists was proposed in [25].

Decision trees are usually build recursively by choosing at each node level the best attribute(s) based on which node data is best split. Hence, most decision trees use an optimization process during their induction. Whether we are talking about splitting data within a DT node, or in general, a good solution requires a trade-off between precision and predictability, i.e. between performance at training and at test level. Tackling this problem from the optimization point of view has been approached in many ways, for example by constructing and combining different performance indicators [26], by using multi-objective approaches [27, 28], or by including fuzzy concepts [29].

When considering the design and induction of decision trees, there are several criteria that can be used to classify approaches. One of them is the manner in which the tree is inducted, i.e., top-down [8, 30], or bottom-up [31, 32]. In the classic and probably most popular, top-down approach, which is the one that we use in this paper, the tree induction starts with the entire data-set in the root node. Bottom-up approaches usually employ some clustering or community structure method to create leaves and build the tree starting from them. On the other hand, optimal decision trees construct the entire tree as an optimization problem [33].

One of the most important features that characterizes a particular flavor of decision tree is the way it splits data at node level. Most decision trees use a hyperplane to split data. The first ones used axis parallel hyper-planes, involving one attribute at each node level [20]. This attribute is chosen to maximize/minimize some purity indicator such as the entropy or gini index [8]. A direct generalization was the introduction of oblique decision trees [34, 35], where node splitting hyper-planes use several attributes. In this case the induction is more computationally expensive, but in general they are more accurate than axis-parallel ones. Other splitting rules involve polytopes [36] or nonlinear splits [37]. Recently, soft approaches borrowed features from neural networks, such as back-propagation and the use of the sigmoid function to split node data [38, 39]. Another approach is to combine multiple evaluation measures [40].

Regardless of the type of split, a criterion is also required to choose among possible parameters. Usually, this parameter involves evaluating the purity of the split sets with respect to the labels, and whether the approach is based on statistics or information theory, it more or less uses the proportion of instances with different labels in the split data [41]. A variation of the same purity criterion is also used to decide if the data in a node is pure enough and it does not require further splitting, and the node can become a leaf. A node may also become a leaf if a maximum allowed tree depth is reached. The tree size is also subject to decision making from the design point of view as a trade-off between accuracy, computational complexity, and overfitting is desired [8, 42]. At last, but not at least, an important aspect consists in the method used to interpret data from leaves, i.e., the separate regions, in order to make predictions regarding the labels of new instances that belong to that region. Most often a probability based on the proportion of instances with each label in the leaf is computed.

Many other methods for tree construction have been proposed in the literature. A minimum query set that uses linear programming and a genetic algorithm is proposed in [43]. A blockchain-based ID3 Decision tree classification framework for distributed networks is proposed in [44]. Functional trees use a linear combination of features in the nodes [45]. DTs are highly adaptable to different types of data. A nested trees approach for longitudinal data is proposed in [46]. A decision tree for sequences is designed in [47]. In [48] a quantum decision tree is proposed. A decision tree based on granular computing theory is presented in [49]. A decision tree designed for ordinal classification problems is designed in [50].

Optimal decision trees [33] convert the induction of the tree into a mixed-integer optimization problem. This method has been widely extended and used in various applications. The optimal randomized classification tree in [51] is based on continuous optimization using a random decision-making process at each node. An column-generation-based metaheuristic for learning trees for classification is proposed in [52]. Dynamic programming is used for MurTree in [53].

Another category of widely used decision trees are based on fuzzy logic [29]. In [54] the authors extend a Hoeffding Decision Tree with fuzziness for data stream classification. Fuzzy decision trees are used in a three-way classification approach in [55]. A decision tree based on a fuzzy analytic hierarchy process is used for classification of electronic music in [56].

Optimization methods [57] play an important role in machine learning, and particular in classification [58, 59]. As mentioned above, the induction of DTs often uses an underlying optimization method. Depending on the approach, exact methods or heuristics can be used. Examples of heuristic search for machine learning in general are multiple: training multi-layer perceptron with artificial algae algorithm [60], chaos theory in Metaheuristics [61], chaotic golden ratio guided local search for big data optimization [62]. The advantage of using heuristics for optimization in machine learning is that usually they do not require certain mathematical properties for the objective function, thus making them adaptable to various, difficult practical applications. However, they are sometimes subject to theoretical scrutiny, as convergence is most often provided only by means of statistical methods. One of the advantages of the method proposed in this paper is that it does not depend on, or require an optimization heuristic.

There is no doubt that the literature that describes various types of decision trees for classification is rich in methods and applications, offering little room for innovation and improvement. However, the fact that there was a continuous need for new flavors of decision trees, not only for different applications, but also designed in general for classification, indicates that there is still a gap to be filled. Although most avenues have been thoroughly explored, considering the complexity of the problem, the exploration of using different solution concepts may still provide the decision maker with new perspectives over the solution space. Game theory, as a field, models strategic, or conflicting situations, providing various solutions with strong theoretical background. In this paper we aim to explore the use of Nash equilibrium as a direct solution for binary classification.

Nash Equilibrium for classification

A normal form game is defined by a set of players, a set of actions available to each player, and a set of payoff functions. The payoff of each player depends on the actions of all players. The Nash equilibrium is defined as a situation of a game in which no player has incentives for unilateral deviation [63]. It represents a stable trade-off in conflicting situations, which makes it an ideal candidate for problems that combine multiple, contradictory objectives such as the classification problem. In spite of that, and of the fact that there is a large amount of work relating game theory and machine learning [64], there is very little research regarding the use of NE directly as a solution for classification.

An attempt to model the problem as a game, based on SVMs, is presented in [65] and uses the generalized Nash equilibrium as solution for binary and multi-class classification problems. Players representing different labels attempt to minimize their distance to the separating hyperplane. In [66], a tree configuration game is used in stream mining systems. Another application uses a game approach for image classification with a Markov random field model [67]. In [68] a normal form game among data instances is designed and used to estimate parameters of probabilistic binary classification models. Apart from these approaches, in most cases, the NE or other game-theoretic appropriate solution concepts are used to process results and select solutions after the a classification method has been applied.

Most of such applications are found within the adversarial classification framework [69] in which usually two players are involved. There are two motivations for this: (i) the one mentioned here before, that the properties of the NE are particularly suitable to address conflicting situations; and (ii) there are many methods available to compute or approximate the NE of a two-player game. The first one is generally applicable to any problem involving decision making among agents with different objectives. The second one, related to computational aspects related to determining equilibria is the one that usually limits the practical application of this concept on a wider scale.

Adversarial models vary largely in applicability. In intrusion detection, a game is designed between the intruder and the intrusion detection mechanism represented by a binary classification method [70]. Another application, in fault diagnosis [71], uses the NE during a convolutional neural network training. In [72] the authors study the behavior of defenders and attackers under equilibrium; they show that from the attacker part, a Fast Gradient method and from the defender Randomized Smoothing form an equilibrium that can be approximated by using a finite number of samples from the distribution. A game for VoIP detection is designed in [73]. The adversarial setting also allows theoretical studies of equilibrium [74]. However, this paper is not concerned with adversarial models, but with the use of the Nash equilibrium directly as a solution concept within the classification model. For the induction of decision trees, to the best of our knowledge, there have been no attempts to use the NE directly to split node data.

3 Nash equilibria - decision tree - NE-DT

The Nash equilibrium concept is used to compute the parameters for the splitting hyperplane at node level. A splitting game is designed between the two classes. The goal is to find hyperplane parameters that would shift instances with the same label as farther apart from the instances with the other label in order to make them more easy to separate.

3.1 Splitting game

A two-player game \(\Gamma {(X,y \mid j)}\) is designed to split data X, having corresponding labels y, based on attribute j:

  • the players are represented by the two sub-nodes, left-L and right-R, respectively;

  • the strategy of each player is to propose a hyperplane parameter \(\beta ^L\) and \(\beta ^R\), respectively; \(\beta ^L, \beta ^R \in \mathbb {R}^2\)

  • the payoff to be minimized by each player is designed aiming to shift instances with different labels as far as possible from each other while searching for the lowest parameters possible:

    $$\begin{aligned} \begin{array}{lll} \displaystyle u^L(\beta ^L, \beta ^R)& = & \sum _{{x_{i}\in X, y_i=0}} (x_{ij}\beta _{j,1} + \beta _{j,0}) + \Vert \beta ^L\Vert ^2, \\ \displaystyle u^R (\beta ^L, \beta ^R)& =& - \sum _{{x_{i}\in X,y_i=1}} (x_{ij}\beta _{j,1}j+ \beta _{j,0}) + \Vert \beta ^R\Vert ^2. \end{array} \end{aligned}$$
    (1)

    The value of \(\beta _j=(\beta _{j,0}, \beta _{j,1})\in \mathbb {R}^2\) is computed as

    $$\beta _j=\frac{1}{2}(\beta ^L+\beta ^R),$$

    and \(\Vert \beta \Vert ^2=\Vert (\beta _0,\beta _1)\Vert ^2=\beta _0^2+\beta _1^2\).

The payoffs of game \(\Gamma \) are designed in such a manner that the left player L proposes a strategy \(\beta ^L\) that minimizes the dot product between all instances with label 0 and the right player R maximizes the similar product between instances with label 1 and its strategy \(\beta ^R\). In fact, each player proposes a parameter that would group the data with the same label as far away as possible from the data with the other label. A compromise is proposed by considering the average of their strategies as hyperplane parameters.

Nash equilibrium

The Nash equilibrium [17, 63] is one of the most popular solution concepts in game theory as it is designed to offer a type of stability that is appealing in practical, conflicting, situations. A NE is a situation of the game, i.e., a choice of strategies for each player, such that none of them has an unilateral incentive for deviation. Thus, in a NE, none of the players can gain anything by changing their strategies while the others maintain theirs. The Nash equilibrium for game \(\Gamma \) is represented by strategies \(\beta ^L\) and \(\beta ^R\) such that there is no change in either of them that would lead to a better payoff while the other player maintains its choice.

The NE for game \(\Gamma (X,y\mid j)\) can be computed by solving the following optimization problems:

$$\begin{aligned} \min _{\beta ^L}u^L(\beta ^L,\beta ^R) \end{aligned}$$
(2)

and

$$\begin{aligned} \min _{\beta ^R}u^R(\beta ^L,\beta ^R) \end{aligned}$$
(3)

By re-writing the payoff functions \(u^L\) and \(u^R\) and replacing \(\beta _j\) we have:

$$\begin{aligned} u^L(\beta ^L,\beta ^R)=\frac{n_0}{2}(\beta _0^L+\beta _0^R) + \frac{1}{2} \sum _{\begin{array}{c} x_i\in X,\\ y_i=0 \end{array}}x_{ij}(\beta _1^L+\beta _1^R)+(\beta _1^L)^2+(\beta _0^L)^2, \end{aligned}$$

and

$$\begin{aligned} u^R(\beta ^L,\beta ^R)=-\frac{n_1}{2}(\beta _0^L+\beta _0^R) - \frac{1}{2} \sum _{\begin{array}{c} x_i\in X,\\ y_i=1 \end{array}}x_{ij}(\beta _1^L+\beta _1^R)+(\beta _1^R)^2+(\beta _0^R)^2, \end{aligned}$$

where \(n_0\) represents the number of instances in X with label 0, and \(n_1\) the number of instances with label 1, respectively.

If we start with the left node and write the partial derivatives with respect to \(\beta _0^L\) and \(\beta _1^L\) we get:

$$u_0^{L\,\prime }(\beta ^L, \beta ^R)=\frac{n_0}{2}+ 2\beta _0^L$$

and

$$u_1^{L\,\prime }(\beta ^L, \beta ^R)=\frac{1}{2}\sum _{\begin{array}{c} x_i\in X,\\ y_i=0 \end{array}}x_{ij}+2(\beta _1^L)^2$$

It follows that the optimal (minimum) \(\beta ^{L^\star }\) is:

$$\begin{aligned} \beta _0^{L^\star }=-\frac{n_0}{4} \text {, and }\; \beta _1^{L^\star }=-\frac{1}{4}\sum _{\begin{array}{c} x_i\in X,\\ y_i=0 \end{array}}x_{ij} \end{aligned}$$
(4)

In a similar manner, for the right node we have \(\beta ^{R^\star }\):

$$\begin{aligned} \beta _0^{R^\star }=\frac{n_1}{4} \text {, and }\; \beta _1^{R^\star }=\frac{1}{4}\sum _{\begin{array}{c} x_i\in X,\\ y_i=1 \end{array}}x_{ij}. \end{aligned}$$
(5)

Thus, the NE of game \(\Gamma (X,y,j)\) is \((\beta ^{L^\star },\beta ^{R^\star })\). If follows that \(\beta _j^\star =(\beta _{j,0}^\star ,\beta _{j,1}^\star )\), computed as \(\beta _j^\star =\frac{1}{2}(\beta ^{L^\star }+\beta ^{R^\star })\) has the components:

$$\begin{aligned} \begin{array}{cl} \beta _{j,0}^\star & =\displaystyle {\frac{1}{8}(n_1-n_0)}\\ \beta _{j,1}^\star & =\displaystyle {\frac{1}{8}\bigg (\sum _{\begin{array}{c} x_i\in X,\\ y_i=1 \end{array}}x_{ij} - \sum _{\begin{array}{c} x_i\in X,\\ y_i=0 \end{array}}x_{ij} \bigg )}= \frac{1}{8} \sum _{x_i \in X}(2x_{ij}y_i-x_{ij})\\ \end{array} \end{aligned}$$
(6)

The game \(\Gamma (X,y\mid j)\) provides the equilibrium parameter \(\beta _j^\star \) that is used to separate node data: instead of splitting node data based on \(X_j\), the split will be based on \(X_j\beta _{j,1}^\star +\beta _{j,0}^\star ,\), as they are shifted based on the equilibrium of the game. This means that there is no other possible strategy \(\beta ^{L^\star }\) and \(\beta ^{R^\star }\) that would lead to a better payoff for either players while the other maintains its strategy unchanged.

Splitting hyper-plane

The equilibrium of game \(\Gamma (X, y\mid j)\) is used to shift instances with different labels away from each other. Therefore, in order to construct the splitting hyperplane for a node with data Xy based on attribute j, we first compute the parameter \(\beta _j\) as the equilibrium \(\Gamma (X,y\mid j)\). Using \(\beta _j\) we associate to each instance \(x_i\) the corresponding value \(x_{ij}\beta _{j,1}+\beta _{j,0}\). Let:

$$\begin{aligned} \begin{aligned} \tilde{X}_j^0&= \{ x_{ij}\beta _{j,1}+\beta _{j,0} \mid y_i=0\} \\ \tilde{X}_j^1&= \{ x_{ij}\beta _{j,1}+\beta _{j,0} \mid y_i=1\} \end{aligned} \end{aligned}$$
(7)

The equilibrium of game \(\Gamma (X,y\mid j)\), \(\beta _j\), ensures that the sum of elements of \(\tilde{X}_j^0\) is minimized and the sum of elements of \(\tilde{X}_j^1\) is maximized, as the aim of the game is to separate the products of form \(x_{ij}\beta _{j,1}+\beta _{j,0}\) of instances with different labels.

In order to split node data, based on \(X_j\) and \(\beta _j\), we use as splitting point the average of a representative point from \(\tilde{X}_j^0\) and \(\tilde{X}_j^1\), respectively. In order to eliminate the effect of outliers in the average, percentiles are used: the \(k^{th}\) percentile, \(P_k(\tilde{X}_j^0)\), for \(\tilde{X}_j^0\), and the \((1-k)^{th}\), percentile \(P_{(1-k)}(\tilde{X}_j^1)\), for \(\tilde{X}_j^1\), respectively. The splitting point \(\tilde{\beta }_j\) is computed as:

$$\begin{aligned} \tilde{\beta }_j=\frac{1}{2}\big (P_k(\tilde{X}_j^0)+P_{(1-k)}(\tilde{X}_j^1)\big ), \end{aligned}$$
(8)

and by using \(\beta _j\) and \(\tilde{\beta }_j\) the splitting (axis parallel) hyper-plane can be defined as

$$ \beta _{j,1} x + \beta _{j,0} = \tilde{\beta }_j.$$
Fig. 1
figure 1

Example 1, illustrating the way instances of an attribute are shifted using game \(\Gamma \), and the corresponding splitting point. The left image represents the original instances with their labels, and the right one the same instances, shifted by using the game \(\Gamma \) NE; the red line represents the splitting point

The rule for splitting data in X based on attribute j in data for the left sub-node \(X^L_j\), and for the right sub-node \(X^R_j\), respectively is:

$$\begin{aligned} \begin{aligned} X_j^L&=\{x\in X \mid \beta _{j,1} x + \beta _{j,0} \le \tilde{\beta }_j \}, \\ X_j^R&=\{x\in X \mid \beta _{j,1} x + \beta _{j,0} \ge \tilde{\beta }_j \}, \end{aligned} \end{aligned}$$
(9)

with the corresponding set of labels:

$$\begin{aligned} \begin{aligned} y_j^L&=\{y \in Y \mid \beta _{j,1} x + \beta _{j,0} \le \tilde{\beta }_j \}, \\ y_j^R&=\{y \in Y \mid \beta _{j,1} x + \beta _{j,0} \ge \tilde{\beta }_j \}. \end{aligned} \end{aligned}$$
(10)

Example 1

Consider attribute \(X_j\) (\(j=1\)) of a data set with 50 instances and two labels (0 and 1), represented in Fig. 1, left, and generated by using the function make_classification from the scikit-learnFootnote 1 Python package, with a class separation parameter of 0.5. The right figure represents the two shifted sets \(\tilde{X}_j^0\) and \(\tilde{X}_j^1\), as well as the corresponding splitting point, illustrating the way the NE of the game, \(\beta \), and \(\tilde{\beta }\) shift data instances and determine the equilibrium based splitting hyperplane.

Choosing the splitting attribute

In order to chose the attribute based on which data in the node is split, we compute \(\beta _j\), \(\tilde{\beta }_j\), and the corresponding sets \(X_j^L, y_j^l\) and \(X_j^R, y_j^R\) for each attribute \(j=\overline{1,d}\). Any criterion for evaluating the splits, e.g., entropy, gini, etc., can be used to compute the quality C(Xyj) of the split and chose the best attribute \(j^*\) to be used for the split based on its value.

GameSplit node data

The GameSplit algorithm for computing the splitting hyperplane for a node is outlined in Algorithm 1. For each attribute j, we compute \(\beta _j\) and \(\tilde{\beta }_j\) using (6) and (8), respectively, followed by the sets \(X_j^L, y_j^L\) and \(X_j^R, y_j^R\). The attribute that best splits the data, based on some usual splitting criterion, using parameters determined as above, is chosen for the split. The rule of the node is based on attribute \(j^*\) and splitting parameters \(\beta _{j^*}\) and \(\tilde{\beta }_{j^*}\) based on (9).

Algorithm 1
figure a

GameSplit(\(X,y,\kappa \)): split node data (Xy), with percentile \(\kappa \).

3.2 Nash equilibria - decision tree

The Nash Equilibria Decision Tree is constructed by using a top-down approach. The root node starts with the initial set of data and the tree is build recursively. The outline of building a NE-DT is presented in Algorithm 2. At each node level, if the node does not become a leaf, the splitting hyperplane is computed using the GameSplit procedure in Algorithm 1. A node becomes a leaf if: (i) it has only one element, or (ii) the maximum depth of the tree has been reached, or (iii) it is pure, i.e., all data instances have the same class (Algorithm 2, line 3 ).

NE-DT uses two parameters: MaxDepth, the maximum tree depth, and the percentile \(\kappa \), used to decide what proportion of data are left out when computing the splitting point in GameSplit (line 6, Algorithm 1). NE-DT starts with the entire data-set at the root node, i.e. \(depth=0\); parameters \(\kappa \) and MaxDepth are problem dependent.

Algorithm 2
figure b

NE-DT (\(X,y; \kappa , depth, MaxDepth\)): Nash equilibria - Decision Tree Algorithm - outline.

4 Numerical experiments

The numerical experiments illustrate the performance of NE-DT on various synthetic and real-world benchmarks, compared to other decision tree-based approaches.

4.1 Experimental set-up

Two sets of benchmarks are used to evaluate the performance of NE-DT: a set of synthetically generated datasets for which the degree of difficulty can be set and a set of real-world benchmarks from the UCI Machine Learning Repository [1].

Synthetic data

The synthetic data sets were generated considering different numbers of instances, attributes, and degrees of overlap for the two classes. To ensure reproducibility and control over the generation of the test data we used the make_classification function from the scikit-learnFootnote 2 Python library [75]. The following parameters are combined to generate the data: the number of instances (100, 250, 500, 1000, 1500, 2000, 5000, 7500, 10000), the number of features/attributes (2, 3, 5, 10, 20, 30, 50), the number of classes (2), the class separator (0.1, 0.2, 0.5, 1), the weight of each class (0.5), and the seed (500). The combination of these different parameters results in 252 different synthetic data sets with variable degrees of difficulty. The class separator parameter controls the degree of overlap of the classes, with lower values generating more difficult data sets.

Real world benchmarks

The real-world data sets were taken from the UCI Machine Learning Repository [1], requiring binary classification of data. We use the following data sets: iris data set (R1) from which we removed the setosa instances in order to obtain a linear non-separable binary classification problem, the Pima Indians Diabetes data set (R2) with 768 instances and 8 attributesFootnote 3, the Connectionist Bench (Sonar, Mines vs. Rocks) data set (R3) which has 208 instances and 60 attributes, the banknote authentication data set (R4) with 1372 instances and 4 attributes, the Somerville Happiness Survey data set (R5) with 143 instances and 7 attributes, and Haberman’s Survival data set (R6) with 306 instances and 3 attributes. The difficulty of the classification is illustrated by known results reported by state-of-the-art methods; we expect a data set for which we find in literature smaller AUC values to be more difficult.

Comparison with other methods

We compare the results of NE-DT with other tree-based classifiers in order to place the performance of NE-DT among other similar classification models. Thus, we use several variants of a Decision Tree (DT), with different hyperparameter values [20], and an Oblique Decision Tree [35], which uses oblique hyperplanes to split node data.

Performance evaluation

To evaluate the expected prediction error, we use the stratified k-fold cross-validation strategy [19]. Each data set is split into k equal sized folds; \(k-1\) of them are used to fit/train the model, and the \(k^{th}\) to test/predict. We repeat this procedure k times, such that all k folds are used once as test folds and evaluate the results for the k predictions. Each fold is roughly balanced; i.e., it contains approximately the same number of instances from each class.

Three performance metrics are used to evaluate the results reported by the models for each test fold: the AUC (area under the ROC curve) [76], the \(F_1\) score [19], and the accuracy (ACC). The AUC can be viewed as the probability that a randomly selected positive sample is ranked higher than a randomly selected negative sample [77]. The \(F_1\) score is a popular metric for binary classification, it is the harmonic mean of the precision and recall metrics. The \(F_1\) score is independent of the number of samples correctly classified as negative and varies for class swapping. The ACC measures the fraction of correctly identified samples. The three metrics have values between 0 and 1, they all indicate the best result at 1, and they can be used to compare the quality of the results. The Pima diabetes data set is slightly unbalanced (0.35/0.65). However, since we use the same performance metrics for all methods, we can still compare the results without considering separate indicators.

Thus, each data set is divided into k-folds and the values of the three performance metrics for each test fold are recorded. The splitting process is repeated several times, with different random seed generators in order to gather enough data for comparisons. The synthetic data sets are then grouped to be analyzed based on different criteria (number of instances, attributes, etc.).

Parameter settings

NE-DT uses two parameters: maximum tree depth MaxDepth, and percentile \(\kappa \) used to split data. We present the behavior of NE-DT when using a MaxDepth of 5 and 10, and \(\kappa =0.2\); as split criterion both gini and entropy. The effect of \(\kappa \) on results is further investigated by testing the values from 0 to 0.9, with step 0.1.

Fig. 2
figure 2

CDF plots of the AUC values reported by each classifier, on each test fold, for the synthetic data. The results are grouped by characteristics of the data sets. For each classifier the results obtained with different hyper-parameters are aggregated (NE-DT, DT0, DT, and oblique DT), in order to observe the overall performance. The rightmost curves indicate better overall AUC values

For the other classifiers, we tested different hyper-parameter values. For the DT both gini and entropy are tested as split criteria, and the maximum tree depth is set to 0, 5 and 10. A depth of 0 means that the tree splits the data until all the leaves are pure (contain instances of only one class). For the oblique DT the maximum depth is set to 5 and 10.

For each data set we used stratified 10-fold cross-validation (\(k=10\) folds), and repeated the process 12 times using different random seeds. The StratifiedKFold function from scikit-learn with the following seeds: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,  and 60 was used. Thus, for each dataset, we trained/tested 120 folds.

Statistical comparisons

The statistical analysis of the results is conducted in two manners: (i) by comparing results reported on individual folds for all data sets; (ii) by comparing averaged results for each dataset.

Fig. 3
figure 3

Significance tests: Number of experiments in which NE-DT results were assigned each Tukey letter for the synthetic data sets. Letter “a” indicates that NE-DT results are as good as or better than the other methods in the same group. The one-way ANOVA and post-hoc Tukey test were performed for all synthetic data sets for the three metrics: AUC, \(F_1\), and ACC.

In the first case, data sets are compared based on the values reported on each fold (120 values for each dataset). In this case, mostly due to size, we found that we can consider the data to be normally distributed, so we can use a one-way ANOVA test for the comparison of means. If differences in means between populations can be considered, we follow with a Tukey post-hoc test with letters to rank results. The Tukey post-hoc test assigns a letter to each method by comparing means and ranking them in groups that are significantly different, starting with the letter “a” for the group reporting the best results (highest indicator value); an algorithm may belong to several groups. We also report the number of instances that NE-DT results were assigned to each letter/group category.

In the second approach, we average indicator values for each data set and compare the results (all methods on all data sets). We find that in this situation we cannot consider values to be normally distributed. The Friedman test for medians with a post-hoc Nemenyi analysis where applicable was used, as recommended in [78].

4.2 Results

The numerical results are presented as: CDF plots to illustrate differences between methods; bar-plots for the Tukey letters counting the number of times NE-DT was assigned to each group, and Nemenyi post-hoc Critical distance diagrams to rank methods over all datasets. Numerical values are presented for the real-world benchmarks and for the Nemenyi post-hoc analysis. The section ends with a discussion.

Synthetic benchmarks

Empirical cumulative distribution plots (CDFs) allow visualization and comparison of indicator values grouped by different characteristics of the data sets. A CDF plot shows the proportion of elements that are less than or equal a given point. Thus, when comparing CDF curves the rightmost ones can be considered better, as this indicates that there is a smaller probability of taking a smaller value.

Fig. 4
figure 4

ECDF plots of the AUC metric reported by NE-DT, on each test fold, for the synthetic data and for different values of the parameter \(\kappa \). The results are grouped by the parameters used to generate the data sets

Figure 2 compares the results using CDF plots of AUC values obtained on all test folds for all synthetic data sets. Each sub-figure presents distributions for a different value of the parameters used to generate the synthetic data sets. The results reported by all tested variants of a classifier using different combinations of hyperparameters are aggregated resulting four cdf curves: NE-DT (results obtained by all NE-DT classifiers - all combinations of parameters for NE-DT), DT0 (the DT splits nodes until leafs are pure and contain only one class), DT, and oblique DT. By aggregating results we can observe the overall performance of a classifier type and better evaluate the NE-DT on different data sets that present various degrees of difficulty. For example, the sub-figure with title no. instances = 1500 presents the results obtained on each test fold for the synthetic data sets that have 1500 instances and all combinations of the parameters number of features (2, 3, 5, 10, 20, 30, 50) and class separator (0.1, 0.2, 0.5, 1). For this sub-figure, we can observe that our approach, NE-DT, reports better results as the cdf plot for NE-DT is the rightmost plot (for each point of the plot we have a higher probability of obtaining a higher AUC value with the NE-DT classifier than with the other compared classifiers). Figure 2 shows that NE-DT reports better results than the other classifiers for the different parameters of the generated data sets.

Table 1 NE-DT results for the real worlds data sets: mean and standard deviation of the AUC, \(F_1\), and ACC metrics for the test folds, for each parameter setting of NE-DT, for each dataset, with the post-hoc Tukey letter of NE-DT in parenthesis

To assess the significance of differences among results, we used the one-way ANOVA test to check for differences between means. Since we found that differences exist, we continued the analysis with the Tukey post-hoc test with letters. The results are presented in Fig. 3. Each row contains results related to a different characteristic of the data set (as indicated by the label on the x axis). Each column of figures presents the number of times that NE-DT is placed in that Tukey group, with the leftmost starting with letter “a” indicating the group with the best results. Each individual figure presents the results for all three metrics reported and counts the number of times NE-DT is assigned the Tukey letter corresponding to the column; results are grouped by the parameter of the data set indicated by the row. For example, the figure of the first row and column (the upper left figure) counts how many times NE-DT is assigned the Tukey letter “a” if we group the results based on the number of instances of the data sets. Figure 3 shows that NE-DT is most often placed in the group “a", indicating that it achieves results better or as good as the other methods in most cases.

NE-DT parameters

Parameter \(\kappa \) is used to compute the split point \(\tilde{\beta }_j\) and controls the proportion of data that is left out when computing it 8. In order to evaluate the robustness of NE-DE with respect to \(\kappa \), numerical experiments with values ranging from 0 to 0.9 were performed. Figure 4 presents the ECDF plots of AUC values obtained. It can be observed that when we are not using any percentile, i.e. when \(\kappa =0\), we have the worst results in most cases. However, for values different than zero NE-DT results are very similar, indicating robustness with respect to this parameter. Although there is no significant difference between other settings of \(\kappa \) values (apart from 0), it is reasonable to set it so that it leaves out \(20\%\) of the values for both labels, avoiding \(\tilde{\beta }_j\) being affected by extreme values on either side.

Real-world data sets

Table 1 presents the results reported by NE-DT for the real world data sets. We report the mean and standard deviation for all three metrics for different parameters of NE-DT (maximum tree depth and split criterion). We also report the Tukey letter/rank of NE-DT when we compare its results, for each metric, to the other classifiers. If we look at the AUC indicator, NE-DT reports results that are statistically better (or as good as) than the other classifiers for the data sets R1, R3, R4, R5, and R6 for maximum depth of the three of 10, with no difference in results between the two split criteria. In terms of the \(F_1\) indicator, overall, the proposed approach seems to provide better results when the maximum depth is set to 10. For the R3 and R4 data sets, NE-DT reports statistically better results than the compared classifiers. When we look at the accuracy indicator a maximum depth of 5 yields better results for the R2 data sets, and for the other data sets all parameters produce similar results. For all data tested, except R6, NE-DT reports significantly better results.

Overall comparisons

In order to compare and rank all methods over the entire benchmark sets, the autorank Python package [79] was used, for each indicator and separately for the synthetic and real-world benchmarks. The following encoding for the method names was used to report the results (Table 2):

Table 3 presents the summary of results for synthetic data for each indicator. For each method, we present the median (MED), the median absolute deviation (MAD), and the mean rank (MR), as well as the confidence interval for medians, for the three indicators to illustrate differences among metrics as well. The methods are ranked in descending order, i.e., the last one ranks the best. Figures 56, and 7 present the CD diagrams illustrating the results of the Nemenyi post-hoc test for the reported indicators.

Table 2 Methods used for comparison: the Code in the leftmost column is used for representing results
Table 3 Summary of results: MR (mean rank), MED (median), MAD (mean absolute median) and CI (confidence interval) for each indicator and each method

With regard to the AUC indicator, the Nemenyi post-hoc test places the methods in the following groups, in decreasing order of their ranking (Fig. 5):

  1. 6.

    DT (5, gini or entropy);

  2. 5.

    DT (5, gini) and Oblique DT (5 or 10);

  3. 4.

    Oblique DT (5 or 10) and DT (0, gini);

  4. 3.

    Oblique DT (5), DT (0, gini), and DT (10, entropy);

  5. 2.

    NE-DT (10, gini or entropy), DT (0, gini or entropy), and DT (10, gini or entropy);

  6. 1.

    NE-DT (5, gini or entropy).

The number next to each type of tree represents the maximum depth.

Figure 6 indicates the following groups with no significant differences for the \(F_1\) indicator in the following order:

  1. 6.

    NE-DT (5, gini or entropy) and DT (5, gini or entropy);

  2. 5.

    NE-DT (5, gini or entropy), NE-DT (10, gini), and DT (5, gini);

  3. 4.

    NE-DT (10, gini or entropy) and Oblique DT (5 or 10);

  4. 3.

    Oblique DT (5 or 10), DT (10, entropy), and DT (0, gini);

  5. 2.

    DT (10, gini or entropy) and DT (0, gini);

  6. 1.

    DT (0, gini or entropy) and DT (10, gini).

We find that with regard to this indicator, NE-DT results are not as competitive as in the case of the AUC; however, average median values go from 0.671 to 0.714, so this may be a situation in which, while there is significance in differences, the actual numerical differences are not remarkable in value.

Fig. 5
figure 5

CD diagram to visualize the results of the Nemenyi post-hoc test for the AUC indicator. The horizontal lines indicate that differences are not significant (\(CD=1.0497\))

Fig. 6
figure 6

CD diagram to visualize the results of the Nemenyi post-hoc test for the F1 indicator. The horizontal lines indicate that differences are not significant (CD = 1.0497)

Figure 7 represents the corresponding diagram for the ACC values, with the following groups, placing NE-DT results among the middle:

  1. 5.

    DT (5, gini or entropy);

  2. 4.

    NE-DT (5, gini or entropy), NE-DT (10, gini), and DT (5, gini);

  3. 3.

    NE-DT (5 or 10, gini or entropy) and Oblique DT (5 or 10);

  4. 2.

    Oblique DT (5) and DT (0, gini);

  5. 1.

    DT (0, gini or entropy) and DT (10, gini or entropy).

Fig. 7
figure 7

CD diagram to visualize the results of the Nemenyi post-hoc test for the Accuracy indicator. The horizontal lines indicate that differences are not significant (CD = 1.0497)

For the real-world datasets we found that the hypothesis for normality cannot be rejected, and that the data is homoscedastic for all indicators. Therefore ANOVA was used again, and we found there is no significant difference among means when comparing all datasets with all the methods \(p=0.597\) for the AUC indicator and \(p=0.386\) for ACC. There was a significant difference in means for the \(F_1\) indicator (\(p=0.026\)), so the test was followed by a Tukey post-hoc analysis that indicated that there are no significant differences within the following groups:

  1. 2.

    NE-DT (5 or 10, gini or entropy), DT (0, gini or entropy), and DT (10, gini);

  2. 1.

    NE-DT (5 or 10, gini), NE-DT (5, entropy), DT (0, 5 or 10, gini or entropy), and Oblique DT (5 or 10).

These results are consistent with those presented in Table 1 that show NE-DT results being assigned mainly letters “a” and “b”.

Discussion

The main goal of our proposal is to explore the use of the Nash equilibrium as a solution concept for classification. NE-DT constructs a decision tree that splits data using this concept. The numerical experiments section compares NE-DT results with other decision trees in order to assess if its performance is competitive and if it may be considered for real-world applications for which the decision trees variants with which we are comparing are considered as a candidate tool.

We compared the two sets of benchmarks both over all the tested folds and over aggregated data for each dataset. The reason for this approach was to provide a comprehensive view over the results. For the fold data we constructed CDF plots that allow the comparison of the distributions of performance indicators. When comparing the performance of NE-DT with the other decision trees on the synthetic data sets, we found AUC curves of NE-DT results to indicate overall higher values. The ANOVA statistic test showed that indeed there are significant differences in results, and a post-hoc Tukey test with letters assigned NE-DT results in the group reporting the best results in the majority of the cases (Fig. 3. In most of the cases where NE-DT was not placed in the group “a”, it appeared in group “b”. By studying the distribution of results, we can assess that the three metrics used are consistent with each other, with small exceptions.

Regarding the number of instances, we find that the performance of NE-DT slightly decreases with increasing values, with more results in letter group “b” for higher number of instances. Regarding the number of attributes, the most interesting aspect is that NE-DT reported the worst results for the smaller number of attributes (2) and better for the higher number (50). This suggests that the equilibrium approach performs better when there is a wider choice of attributes. Results grouped by class overlap are indicative of a disadvantage of NE-DT, as it seems to perform better on problems that are difficult, with a higher degree of overlap (smaller parameter for generating data) but with less precise results on better separated classes. An explanation for this may be due to overfitting, as the maximum depth row shows that results that use the maximum depth of 10 are significantly worse. There is no significant difference between the two criteria for choosing the splitting attribute, gini and entropy.

The Friedmam test, comparing results over all the synthetic datasets, ranks NE-DT best with respect to the AUC score (Fig. 5. However, as far as the \(F_1\) and accuracy scores, results reported by NE-DT rank in the middle of the groups. The tests on the real-world benchmarks results show no significant difference among methods for AUC and ACC; the \(F_1\) score test places NE-DT results in two significant different groups, one of them ranking best. Regarding NE-DT parameters specific for decision trees, we find that they influence results in expected manners, i.e., trees with a higher maximum depth tend to overfit, and there are no notable differences between the two splitting criteria gini and entropy.

Thus, as a conclusion regarding the analyzed data sets and performance indicators, we can assert that the AUC values reported by NE-DT are in most cases better or at least as good as the other tree-based methods with which it was compared. The accuracy values also ranked among the best, while the \(F_1\) score placed NE-DT in the middle. These results indicate that NE-DT would be useful in a setting in which the value of AUC is of importance, i.e., the values of the probabilities for positive instances have practical meaning and are used further.

Regarding the computational time requred to run experiments, there is no significant difference between NE-DT and all other base DT methods that use axis parallel hyperplanes for splitting node data.

5 Conclusions and further research

The main goal of this study was to explore the use of the Nash equilibrium concept as a solution to the classification problem. Choosing the split parameters for a node within a decision tree represents a decision-making problem that requires an output suitable for predictions. In most situations there are infinite possible solutions that lead to the same split, i.e., the same data split in the sub-nodes and subsequently the same values for indicators such as gini or entropy. We propose the choice of a point which is a Nash equilibrium that may provide additional properties while representing a valid solution for the classification problem. The split is chosen in such a manner that no unilateral deviation to any of the sides of the hyperplane would provide a better separation of the node data.

Thus, NE-DT is a decision tree that splits data based on the equilibrium of a two-player game. Designed for the binary classification problem, the equilibrium spreads the instances with different labels, making them easier to separate. The equilibria-based DT variant is tested on a set of synthetic and real-world data, and the results are compared with those of other decision trees that use the same quality indicators to choose splitting attributes for the nodes.

One of the advantages of the game based proposal is that the equilibrium of the node game is computed analytically, making it computationally competitive with other decision trees. The node splitting game actually takes into account within the payoff function the location of the data, not only probabilities, making the method more data-oriented.

There are several research directions that may stem from this approach. The next task which could improve the classification is to further consider designing a game to select attributes at node level using an equilibrium concept. It is known that the greedy choice of an attribute at the lower level of the trees may not lead to optimal results. In this direction, several paths may be explored. The authors would envisage at least two concrete ones: (i) design a game at node level, assigning a payoff function to each attribute based on the equilibrium of the node game for splitting data; (ii) design an extensive form game to tackle attribute choices over the entire tree. Another possible approach to the attribute selection problem is to explore designing game-based oblique decision trees, in which hyperplane parameters represent game equilibria. The efficiency of the computation also makes it possible to explore the use of boosting and bagging techniques based on the equilibrium data split. Moreover, the use of the Nash equilibrium concept may not be limited to decision trees but also extended to other models. Considering other solution concepts, e.g., the strong Nash equilibrium or the generalized Nash equilibrium, is also a path that is worth exploring for future applications.