1 Introduction

Random forest (RF) is an ensemble-based, supervised machine learning algorithm [5]Footnote 1. It consists of numerous randomized decision trees as an atomic unit used for classification and regression problems. RF can be implemented and executed as the parallel threads, hence it is fast and easy to implement. It has been used for various domains, like medical imaging, pattern recognition and classification [7] etc.

A decision tree in RF is built during the training phase using bootstrap sampling. The performance of the decision tree depends on several important parameters like splitting criteria, feature selection, number of trees, and the number of instances on the leaf node. However, the best choice of these parameters is not answered precisely [10, 14]. This motivated various methods to come up with a heuristic approach in building the decision tree and hence RF. For example, to reduce the computation and to improve the accuracy, Geurts et al. [9] introduced randomness for both attribute selection and for splitting point choice so that it can reduce the variance as compared to weaker randomization approach. Paul et al. have proposed a method to reduce the unimportant features and to limit the number of trees [15]. In addition, there has been some work done on proving the consistency of RF and leveraging dependency on the data by several researchers [3, 4, 8, 17]. Denil et. al. [8] used Poisson distribution in feature selection for growing a tree, whereas Wang et al. [17], has proposed a Bernoulli Random Forest (BRF) framework incorporating Bernoulli distribution for the feature and splitting point selection. In recent years, the success of deep neural network has inspired other learners to benefit from deep, layered architecture. Therefore, Zhou et al. [20] proposed the Deep RF whose performance is robust to the hyper-parameters settings.

Murthy et al. [13] have proposed an oblique decision tree. It splits the feature space using a hyperplane defined by a linear combination of the feature variables. There may exist much domain where one or two oblique hyperplanes will give the best classification. In such situations, the axis parallel split has to approximate the correct model with a staircase structure. But, the computational cost of the oblique decision tree is exponential, which makes it an NP-hard problem [13]. Wickramarachchi et al. [18] proposed a new way to induce oblique decision tree using the eigen vectors of the estimated covariance matrices of the respective classes. In both methods, parameters tuning is more time consuming, therefore it takes a longer time to find the best fit hyperplane. However, it captures the linear relationship between the features. We have proposed an M-ary random forest (MaRF) approach. It uses ‘N’ independent features at a time to partition the feature space into \(2^{N}\) regions. The proposed approach is tractable and takes less time in comparison to the oblique decision tree.

The remaining paper has been arranged in the following manner: Sect. 2, presents the proposed MaRF approach. Section 3, discuss the implementation details, and performance analysis over UCI and HSI dataset. It has been concluded in Sect. 4.

Fig. 1.
figure 1

An example of binary decision tree.

Fig. 2.
figure 2

An example to 4-ary decision tree.

2 Proposed Approach

In conventional RF [5], each decision tree is designed as a binary tree using axis parallel splits. It partitions the feature space into two subspace, with feature \(X_{i}\), and a threshold value \(\tau _{1}\). The selection of feature is done on the basis of the optimum value of splitting criterion. If \(X_{i} < \tau _{1}\), go to subtree “a” otherwise goto subtree “b”, as shown in Fig. 1. At any internal node, there are at most two subtrees. However, the binary decision tree is unable to capture feature dependency. Therefore, M-ary RF is proposed. It uses “N” independent features at a time, to divide the feature space into a maximum of \(2^{N}\) sub-space. It computes the splitting criterion value for all possible combination of “N” features to decide features for splitting. For example, consider \(N = 2\), it will divide the feature space, into \(2^2 = 4\) sub-space, refer Fig. 2. Let \(X_{i}\) and \(X_{j}\) are two selected features with their threshold values as \(\tau _{1}\) and \(\tau _{2}\) at an internal node of M-ary decision tree, which are selected as features for splitting. If both \(X_{i} < \tau _{1}\) and \(X_{j} < \tau _{2}\) are true (T) then divide the data into subtree “a”. If first is true and second is false (F) then divide into subtree “b”. If first is F and second is T then divide into subtree “c” and if both are F then divide into subtree “d”. Refer Algorithm 1 for designing the M-ary decision tree. In MaRF, decision trees are constructed as M-ary decision tree, later the prediction is done on the basis of majority voting.

figure a

3 Experiments and Results

The proposed method has been rigorously tested over twenty-one datasets for both classification and regression tasks. It has also been tested over real-life application using Hyperspectral imaging (HSI) dataset.

3.1 Datasets

We have UCI datasets [2] for evaluation of the proposed method. These datasets are varying in terms of number of classes, number of features, and number of instances and hence, are heterogeneous in nature. The ratio of the number of instances to the number of features in these benchmark datasets has high variations. The detailed description of the datasets are given in Tables 1 and 2 for classification and regression respectively.

Table 1. Classification accuracy (in %) comparison between state-of-the-art methods and proposed MaRF with average over 10 iterations (High value is preferred)
Table 2. MSE comparison between state-of-the-art methods and proposed MaRF with average over 10 iterations (Least value is preferred)

3.2 Parameters

There are four main parameters associated with decision tree, namely: (1) the number of trees \({n_{tree}}\), (2) the number of minimum instances at leaf node \({n_{min}}\), (3) the train-test ratio in which dataset is divided into training set and test set, (4) the maximum tree depth \({T_{depth}}\). In our experiment, the value of \({n_{tree}}\) is kept as 45, which has been decided empirically. The \({n_{min}}\) is kept as 5 and the train-test ratio is kept as 0.7. The \({T_{depth}}\) is varying for each datasets. Each of the experiment has been repeated over 10 iterations and the average mean value is computed.

Table 3. Comparison of \(T_{depth}\) between conventional RF and MaRF

3.3 Performance Analysis

The results generated with the proposed MaRF are compared with conventional RF [5], and the recent state-of-the-art methods for the classification and regression datasets. The highest learning performance among these comparisons is marked in boldface for each dataset. For classification performance analysis, the experiment has been conducted with the eleven well known UCI datasets. The proposed method is showing improvement for nine datasets out of eleven in comparison to Biau08 [4], Biau12 [3], and Denil [8] state-of-the-art methods. To investigate the effect of choosing the N-independent features for partitioning over the binary splits, it has been compared to conventional RF. One can observe from the Table 1, the proposed MaRF is showing improvement for all the datasets except for Abalone as compare to conventional RF. We have also computed the depth of tree \(T_{depth}\), for few datasets and compared with conventional RF, as shown in Table 3. One can observe that using MaRF approach, the decision tree has grown to lesser depth as compared to conventional RF. Therefore, the testing time in MaRF would be less as compared to conventional RF.

In regression, it can be observed from Table 2 that MaRF achieves the significant reduction in MSE on all of the datasets as compare to all state-of-the-art methods. In particular, one can observe that for Yacht, Student and Concrete dataset, the proposed method has reduced the MSE by more than 40%. From Table 2, one can observe the proposed MaRF method is showing improvement for all categorical and numerical datasets.

3.4 Computation Cost Analysis

In conventional RF, consider the case of extremely randomized tree [9] such that \(k=\sqrt{q}\) features are selected at each node, here ‘q’ is the number of features. Suppose, there are ‘p’ number of instances then, the time complexity to construct a tree would be \(\mathcal {O}(k \cdot p)\) [12]. In case of an oblique decision tree, the possible number of distinct hyperplane would be \({p \atopwithdelims ()k}\) and each feature value could be selected in \(2^k\) possibilities. Therefore, the computation cost would be \(\mathcal {O}(2^k \cdot {p \atopwithdelims ()k})\) [13]. Therefore, it is an NP-hard problem. In case of MaRF, suppose \(N = 2\) features are used at each node for splitting. Hence, it would search in \({k \atopwithdelims ()2}\) ways to choose two best features. In general, the computation cost would be \(\mathcal {O}({k \atopwithdelims ()N} \cdot p)\). The MaRF approach does not require any parameters tuning for multi-feature splitting as required in the oblique decision tree.

3.5 Real Life Application: Hyperspectral Image Classification

In this section, we evaluate the performance of MaRF on two publicly available benchmarks HSI datasets, named Indian Pines and Pavia University [1]. Indian Pines dataset is captured with 224 spectral bands in the wavelength range from 0.4 to \(2.5 \mu m\). It has \(145 \times 145\) pixels and spatial resolution is 20 m/pixel. The 200 band are selected for classification after 24 noisy bands being removed. The reference map has 10366 pixels that belong to 16 classes [1]. Pavia University data has 115 spectral bands in the wavelength range from 0.43 to \(0.86 \mu m\). It has \(610 \times 340\) pixels with high spatial resolution of 1.3 m/pixel. The 12 noisy bands are removed, and the remaining 103 bands are selected for classification. The referenced map have 42776 pixels that belong to 9 classes [1].

All the parameters are kept the same, except the train-test ratio. Since due to the limited availability of the labeled training samples in HSI make the classification task more challenging [11]. Therefore, only a limited set of instances are chosen for the training part. Hence, 15 instances from each class of Indian Pines, and 50 instances per class for Pavia University are chosen for training and rest used for testing. The performance of the algorithms is measured using overall accuracy (OA), average accuracy (AA), and kappa coefficient (\(\kappa \)) [16]. The OA calculate the ratio of the number of correctly classified samples to the total number of test samples. The AA is the average percentage of correctly classified samples for each class. The value of the \(\kappa \) coefficient is used for the consistency check. High value is preferred for all the three measures [16].

The proposed MaRF method is compared with the SVM-3DG [6] and CRF [19]. Cao et al. [6] proposed an approach to use spatial information as prior for extracting spatial features before classification as well as for labeling the class in post-processing. The one can observe from the Table 4 that for Indian Pines image, proposed MaRF method has improved the overall accuracy of predicting the correct class labels by more than \(4\%\) value. Also, it is showing the improvement in consistency with a rise in \(\kappa \) coefficient by \(5\%\) value. However, the proposed approach is generating poor class-wise efficiency for the small class. This is due to the class imbalance problem. For the Pavia University image, the proposed MaRF method is showing the improvement for all (OA, AA, and \(\kappa \) coefficient) the measuring parameters. If we compare with the conventional RF, then also the propose MaRF method is showing significant improvement for class-wise accuracy and for all other measured parameters. The results are encouraging in the sense like even, the SVM-3DG [6] uses extracted features for classification, and CRF [19] uses feature selection still the proposed MaRF is outperforming which is directly applied over the pixels without extracting any features. One can also observe the visual impact of the corresponding classification map using Fig. 3.

Table 4. Class Specific Accuracy, OA, AA and kappa coefficient (\(\kappa \)) (in %) obtained by SVM-3DG [6], CRF [19], RF [5] and MaRF for Indian Pines and Pavia University Dataset (# Iterations = 10)
Fig. 3.
figure 3

Classification map of Indian Pines (Row 1) and Pavia University (Row 2) images, showing the original map, ground truth and MaRF for the test samples from left to right respectively

4 Conclusion

In this paper, we have proposed an M-ary random forest (MaRF) based on multi-feature splitting. The proposed MaRF approach is tested over several well known heterogeneous datasets from the UCI repository. It has shown promising results for both classification and regression kind of tasks as compared to conventional RF and other state-of-the-art methods. In MaRF, decision trees grow up to lesser depth as compare to the conventional RF, which results in reducing the testing time as well. The MaRF has also been tested over Hyperspectral datasets. The results showed significant improvement in terms of AA, OA, and kappa coefficient as compared to state-of-the-art methods. Overall, the experimental results show that the proposed method can provide another direction to the world for further exploration.