Abstract
The Random Forest (RF) classifier has the capacity to facilitate both wrapper and embedded feature selection through the Mean Decrease Accuracy (MDA) and Mean Decrease Impurity (MDI) methods, respectively. MDI is known to be biased towards predictor variables with multiple values whilst MDA is stable in this regard. As such, MDA is the predominantly preferred option for RF-based feature selection, despite its higher computational overhead in comparison to MDI. This research seeks to simultaneously reduce the computational overhead and improve the effectiveness of RF feature selection. We propose two improvements to the MDI method to overcome its shortcomings. The first is using our proposed Purity Gap Gain (PGG) measure which has an emphasis on computational efficiency, as an alternative to the Gini Importance (GI) metric. The second is incorporating a Relative Mean Decrease Impurity (RMDI) score, which aims to offset the bias towards multi-valued predictor variables through random feature value permutations. Experiments are conducted on UCI datasets to establish the impact of PGG and RMDI on RF performance.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Note that although \(X_k\) is drawn from X, the latter is a set while the former is a multiset.
- 2.
Information about node depth is not considered in the current implementation.
- 3.
2 cores running at 2.70 GHz and 2 GB of RAM.
- 4.
This value is larger than N, the size of a training set, X which is drawn from a data set.
- 5.
Sorting is done in ascending order, hence the list is arranged from weakest to strongest.
- 6.
This dataset is easier to analyze since it has fewer features.
References
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(3), 131–156 (1997)
Genuer, R., Poggi, J.M., Tuleau-Malot, C.: Variable selection using random forests. Pattern Recogn. Lett. 31(14), 2225–2236 (2010)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Hu, C., Chen, Y., Hu, L., Peng, X.: A novel random forests based class incremental learning method for activity recognition. Pattern Recogn. 78, 277–290 (2018)
Kawakubo, H., Yoshida, H.: Rapid feature selection based on random forests for high-dimensional data. Expert Syst. Appl. 40, 6241–6252 (2012)
Lal, T.N., Chapelle, O., Weston, J., Elisseeff, A.: Embedded methods. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds.) Feature Extraction, pp. 137–165. Springer, Heidelberg (2006). https://doi.org/10.1007/978-3-540-35488-8_6
Liaw, A., Wiener, M., et al.: Classification and regression by randomForest. R News 2(3), 18–22 (2002)
Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)
Newman, C.B.D., Merz, C.: UCI repository of machine learning databases (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
Nguyen, T.T., Huang, J.Z., Nguyen, T.T.: Unbiased feature selection in learning random forests for high-dimensional data. Sci. World J. 2015 - volume number (2015)
Robnik-Šikonja, M.: Improving random forests. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 359–370. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30115-8_34
Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26, 1651–1686 (1998)
Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 8(1), 25 (2007)
Surname, N.: Publication details withheld for peer review purposes. In: Publication, pp. 1–5000. Organization (2030)
Wang, Y., Xia, S.T.: Unifying attribute splitting criteria of decision trees by Tsallis entropy. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2507–2511. IEEE (2017)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Gwetu, M.V., Tapamo, JR., Viriri, S. (2019). Exploring the Impact of Purity Gap Gain on the Efficiency and Effectiveness of Random Forest Feature Selection. In: Nguyen, N., Chbeir, R., Exposito, E., Aniorté, P., Trawiński, B. (eds) Computational Collective Intelligence. ICCCI 2019. Lecture Notes in Computer Science(), vol 11683. Springer, Cham. https://doi.org/10.1007/978-3-030-28377-3_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-28377-3_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28376-6
Online ISBN: 978-3-030-28377-3
eBook Packages: Computer ScienceComputer Science (R0)