Abstract
Random forest (RF) is a supervised, ensemble of decision trees method. Each decision tree recursively partitions the feature space into two disjoint sub-regions using axis parallel splits until each sub-region becomes homogeneous with respect to a particular class or reach to a stoppage criterion. The conventional RF uses one feature at a time for splitting. Therefore, it does not consider the feature inter-dependency. Keeping this aim in mind, the current paper introduces an approach to perform multi-features splitting. This partition the feature space into M-regions using axis parallel splits. Therefore, the forest created using this is named as M-ary Random Forest (MaRF). The suitability of the proposed method is tested over the various heterogeneous UCI datasets. Experimental results show that the proposed MaRF is performing better for both classification and regression. The proposed MaRF method has also been tested over Hyperspectral imaging (HSI) for classification and it has shown satisfactory improvement with respect to other state-of-the-art methods.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
Random forest (RF) is an ensemble-based, supervised machine learning algorithm [5]Footnote 1. It consists of numerous randomized decision trees as an atomic unit used for classification and regression problems. RF can be implemented and executed as the parallel threads, hence it is fast and easy to implement. It has been used for various domains, like medical imaging, pattern recognition and classification [7] etc.
A decision tree in RF is built during the training phase using bootstrap sampling. The performance of the decision tree depends on several important parameters like splitting criteria, feature selection, number of trees, and the number of instances on the leaf node. However, the best choice of these parameters is not answered precisely [10, 14]. This motivated various methods to come up with a heuristic approach in building the decision tree and hence RF. For example, to reduce the computation and to improve the accuracy, Geurts et al. [9] introduced randomness for both attribute selection and for splitting point choice so that it can reduce the variance as compared to weaker randomization approach. Paul et al. have proposed a method to reduce the unimportant features and to limit the number of trees [15]. In addition, there has been some work done on proving the consistency of RF and leveraging dependency on the data by several researchers [3, 4, 8, 17]. Denil et. al. [8] used Poisson distribution in feature selection for growing a tree, whereas Wang et al. [17], has proposed a Bernoulli Random Forest (BRF) framework incorporating Bernoulli distribution for the feature and splitting point selection. In recent years, the success of deep neural network has inspired other learners to benefit from deep, layered architecture. Therefore, Zhou et al. [20] proposed the Deep RF whose performance is robust to the hyper-parameters settings.
Murthy et al. [13] have proposed an oblique decision tree. It splits the feature space using a hyperplane defined by a linear combination of the feature variables. There may exist much domain where one or two oblique hyperplanes will give the best classification. In such situations, the axis parallel split has to approximate the correct model with a staircase structure. But, the computational cost of the oblique decision tree is exponential, which makes it an NP-hard problem [13]. Wickramarachchi et al. [18] proposed a new way to induce oblique decision tree using the eigen vectors of the estimated covariance matrices of the respective classes. In both methods, parameters tuning is more time consuming, therefore it takes a longer time to find the best fit hyperplane. However, it captures the linear relationship between the features. We have proposed an M-ary random forest (MaRF) approach. It uses ‘N’ independent features at a time to partition the feature space into \(2^{N}\) regions. The proposed approach is tractable and takes less time in comparison to the oblique decision tree.
The remaining paper has been arranged in the following manner: Sect. 2, presents the proposed MaRF approach. Section 3, discuss the implementation details, and performance analysis over UCI and HSI dataset. It has been concluded in Sect. 4.
2 Proposed Approach
In conventional RF [5], each decision tree is designed as a binary tree using axis parallel splits. It partitions the feature space into two subspace, with feature \(X_{i}\), and a threshold value \(\tau _{1}\). The selection of feature is done on the basis of the optimum value of splitting criterion. If \(X_{i} < \tau _{1}\), go to subtree “a” otherwise goto subtree “b”, as shown in Fig. 1. At any internal node, there are at most two subtrees. However, the binary decision tree is unable to capture feature dependency. Therefore, M-ary RF is proposed. It uses “N” independent features at a time, to divide the feature space into a maximum of \(2^{N}\) sub-space. It computes the splitting criterion value for all possible combination of “N” features to decide features for splitting. For example, consider \(N = 2\), it will divide the feature space, into \(2^2 = 4\) sub-space, refer Fig. 2. Let \(X_{i}\) and \(X_{j}\) are two selected features with their threshold values as \(\tau _{1}\) and \(\tau _{2}\) at an internal node of M-ary decision tree, which are selected as features for splitting. If both \(X_{i} < \tau _{1}\) and \(X_{j} < \tau _{2}\) are true (T) then divide the data into subtree “a”. If first is true and second is false (F) then divide into subtree “b”. If first is F and second is T then divide into subtree “c” and if both are F then divide into subtree “d”. Refer Algorithm 1 for designing the M-ary decision tree. In MaRF, decision trees are constructed as M-ary decision tree, later the prediction is done on the basis of majority voting.
3 Experiments and Results
The proposed method has been rigorously tested over twenty-one datasets for both classification and regression tasks. It has also been tested over real-life application using Hyperspectral imaging (HSI) dataset.
3.1 Datasets
We have UCI datasets [2] for evaluation of the proposed method. These datasets are varying in terms of number of classes, number of features, and number of instances and hence, are heterogeneous in nature. The ratio of the number of instances to the number of features in these benchmark datasets has high variations. The detailed description of the datasets are given in Tables 1 and 2 for classification and regression respectively.
3.2 Parameters
There are four main parameters associated with decision tree, namely: (1) the number of trees \({n_{tree}}\), (2) the number of minimum instances at leaf node \({n_{min}}\), (3) the train-test ratio in which dataset is divided into training set and test set, (4) the maximum tree depth \({T_{depth}}\). In our experiment, the value of \({n_{tree}}\) is kept as 45, which has been decided empirically. The \({n_{min}}\) is kept as 5 and the train-test ratio is kept as 0.7. The \({T_{depth}}\) is varying for each datasets. Each of the experiment has been repeated over 10 iterations and the average mean value is computed.
3.3 Performance Analysis
The results generated with the proposed MaRF are compared with conventional RF [5], and the recent state-of-the-art methods for the classification and regression datasets. The highest learning performance among these comparisons is marked in boldface for each dataset. For classification performance analysis, the experiment has been conducted with the eleven well known UCI datasets. The proposed method is showing improvement for nine datasets out of eleven in comparison to Biau08 [4], Biau12 [3], and Denil [8] state-of-the-art methods. To investigate the effect of choosing the N-independent features for partitioning over the binary splits, it has been compared to conventional RF. One can observe from the Table 1, the proposed MaRF is showing improvement for all the datasets except for Abalone as compare to conventional RF. We have also computed the depth of tree \(T_{depth}\), for few datasets and compared with conventional RF, as shown in Table 3. One can observe that using MaRF approach, the decision tree has grown to lesser depth as compared to conventional RF. Therefore, the testing time in MaRF would be less as compared to conventional RF.
In regression, it can be observed from Table 2 that MaRF achieves the significant reduction in MSE on all of the datasets as compare to all state-of-the-art methods. In particular, one can observe that for Yacht, Student and Concrete dataset, the proposed method has reduced the MSE by more than 40%. From Table 2, one can observe the proposed MaRF method is showing improvement for all categorical and numerical datasets.
3.4 Computation Cost Analysis
In conventional RF, consider the case of extremely randomized tree [9] such that \(k=\sqrt{q}\) features are selected at each node, here ‘q’ is the number of features. Suppose, there are ‘p’ number of instances then, the time complexity to construct a tree would be \(\mathcal {O}(k \cdot p)\) [12]. In case of an oblique decision tree, the possible number of distinct hyperplane would be \({p \atopwithdelims ()k}\) and each feature value could be selected in \(2^k\) possibilities. Therefore, the computation cost would be \(\mathcal {O}(2^k \cdot {p \atopwithdelims ()k})\) [13]. Therefore, it is an NP-hard problem. In case of MaRF, suppose \(N = 2\) features are used at each node for splitting. Hence, it would search in \({k \atopwithdelims ()2}\) ways to choose two best features. In general, the computation cost would be \(\mathcal {O}({k \atopwithdelims ()N} \cdot p)\). The MaRF approach does not require any parameters tuning for multi-feature splitting as required in the oblique decision tree.
3.5 Real Life Application: Hyperspectral Image Classification
In this section, we evaluate the performance of MaRF on two publicly available benchmarks HSI datasets, named Indian Pines and Pavia University [1]. Indian Pines dataset is captured with 224 spectral bands in the wavelength range from 0.4 to \(2.5 \mu m\). It has \(145 \times 145\) pixels and spatial resolution is 20 m/pixel. The 200 band are selected for classification after 24 noisy bands being removed. The reference map has 10366 pixels that belong to 16 classes [1]. Pavia University data has 115 spectral bands in the wavelength range from 0.43 to \(0.86 \mu m\). It has \(610 \times 340\) pixels with high spatial resolution of 1.3 m/pixel. The 12 noisy bands are removed, and the remaining 103 bands are selected for classification. The referenced map have 42776 pixels that belong to 9 classes [1].
All the parameters are kept the same, except the train-test ratio. Since due to the limited availability of the labeled training samples in HSI make the classification task more challenging [11]. Therefore, only a limited set of instances are chosen for the training part. Hence, 15 instances from each class of Indian Pines, and 50 instances per class for Pavia University are chosen for training and rest used for testing. The performance of the algorithms is measured using overall accuracy (OA), average accuracy (AA), and kappa coefficient (\(\kappa \)) [16]. The OA calculate the ratio of the number of correctly classified samples to the total number of test samples. The AA is the average percentage of correctly classified samples for each class. The value of the \(\kappa \) coefficient is used for the consistency check. High value is preferred for all the three measures [16].
The proposed MaRF method is compared with the SVM-3DG [6] and CRF [19]. Cao et al. [6] proposed an approach to use spatial information as prior for extracting spatial features before classification as well as for labeling the class in post-processing. The one can observe from the Table 4 that for Indian Pines image, proposed MaRF method has improved the overall accuracy of predicting the correct class labels by more than \(4\%\) value. Also, it is showing the improvement in consistency with a rise in \(\kappa \) coefficient by \(5\%\) value. However, the proposed approach is generating poor class-wise efficiency for the small class. This is due to the class imbalance problem. For the Pavia University image, the proposed MaRF method is showing the improvement for all (OA, AA, and \(\kappa \) coefficient) the measuring parameters. If we compare with the conventional RF, then also the propose MaRF method is showing significant improvement for class-wise accuracy and for all other measured parameters. The results are encouraging in the sense like even, the SVM-3DG [6] uses extracted features for classification, and CRF [19] uses feature selection still the proposed MaRF is outperforming which is directly applied over the pixels without extracting any features. One can also observe the visual impact of the corresponding classification map using Fig. 3.
4 Conclusion
In this paper, we have proposed an M-ary random forest (MaRF) based on multi-feature splitting. The proposed MaRF approach is tested over several well known heterogeneous datasets from the UCI repository. It has shown promising results for both classification and regression kind of tasks as compared to conventional RF and other state-of-the-art methods. In MaRF, decision trees grow up to lesser depth as compare to the conventional RF, which results in reducing the testing time as well. The MaRF has also been tested over Hyperspectral datasets. The results showed significant improvement in terms of AA, OA, and kappa coefficient as compared to state-of-the-art methods. Overall, the experimental results show that the proposed method can provide another direction to the world for further exploration.
Notes
- 1.
Referred to as conventional random forest throughout the text.
References
Indian pines and pavia university dataset. http://lesun.weebly.com/hyperspectral-data-set.html. Accessed 15 Jan 2019
UCI repository. https://archive.ics.uci.edu/ml/index.php. Accessed 15 Nov 2018
Biau, G.: Analysis of a random forests model. J. Mach. Learn. Res. 13(Apr), 1063–1095 (2012)
Biau, G., Devroye, L., Lugosi, G.: Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 9(Sep), 2015–2033 (2008)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Cao, X., Xu, L., Meng, D., Zhao, Q., Xu, Z.: Integration of 3-dimensional discrete wavelet transform and markov random field for hyperspectral image classification. Neurocomputing 226, 90–100 (2017)
Criminisi, A., Shotton, J.: Decision Forests for Computer Vision and Medical Image Analysis. Springer (2013)
Denil, M., Matheson, D., De Freitas, N.: Narrowing the gap: random forests in theory and in practice. In: International Conference on Machine Learning, pp. 665–673 (2014)
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Ishwaran, H.: The effect of splitting on random forests. Mach. Learn. 99(1), 75–118 (2015)
Ji, R., Gao, Y., Hong, R., Liu, Q., Tao, D., Li, X.: Spectral-spatial constraint hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 52(3), 1811–1824 (2014)
Louppe, G.: Understanding random forests: from theory to practice. arXiv preprint arXiv:1407.7502 (2014)
Murthy, S.K., Kasif, S., Salzberg, S.: A system for induction of oblique decision trees. J. Artif. Intell. Res. 2, 1–32 (1994)
Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How many trees in a random forest? In: Perner, P. (ed.) MLDM 2012. LNCS (LNAI), vol. 7376, pp. 154–168. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31537-4_13
Paul, A., Mukherjee, D.P., Das, P., Gangopadhyay, A., Chintha, A.R., Kundu, S.: Improved random forest for classification. IEEE Trans. Image Process. 27(8), 4012–4024 (2018)
Wang, L., Zhao, C.: Hyperspectral Image Processing. Springer (2016)
Wang, Y., Xia, S.T., Tang, Q., Wu, J., Zhu, X.: A novel consistent random forest framework: Bernoulli random forests. IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3510–3523 (2018)
Wickramarachchi, D., Robertson, B., Reale, M., Price, C., Brown, J.: HHCART: an oblique decision tree. Comput. Stat. Data Anal. 96, 12–23 (2016)
Zhang, Y., Cao, G., Li, X., Wang, B.: Cascaded random forest for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 11(4), 1082–1094 (2018)
Zhou, Z.H., Feng, J.: Deep forest: towards an alternative to deep neural networks. arXiv preprint arXiv:1702.08835 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Jain, V., Phophalia, A. (2019). M-ary Random Forest. In: Deka, B., Maji, P., Mitra, S., Bhattacharyya, D., Bora, P., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2019. Lecture Notes in Computer Science(), vol 11941. Springer, Cham. https://doi.org/10.1007/978-3-030-34869-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-34869-4_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34868-7
Online ISBN: 978-3-030-34869-4
eBook Packages: Computer ScienceComputer Science (R0)