Exploring the Impact of Purity Gap Gain on the Efficiency and Effectiveness of Random Forest Feature Selection

Gwetu, Mandlenkosi Victor; Tapamo, Jules-Raymond; Viriri, Serestina

doi:10.1007/978-3-030-28377-3_28

Exploring the Impact of Purity Gap Gain on the Efficiency and Effectiveness of Random Forest Feature Selection

Mandlenkosi Victor Gwetu¹³,
Jules-Raymond Tapamo¹³ &
Serestina Viriri¹³

Conference paper
First Online: 09 August 2019

1820 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11683))

Abstract

The Random Forest (RF) classifier has the capacity to facilitate both wrapper and embedded feature selection through the Mean Decrease Accuracy (MDA) and Mean Decrease Impurity (MDI) methods, respectively. MDI is known to be biased towards predictor variables with multiple values whilst MDA is stable in this regard. As such, MDA is the predominantly preferred option for RF-based feature selection, despite its higher computational overhead in comparison to MDI. This research seeks to simultaneously reduce the computational overhead and improve the effectiveness of RF feature selection. We propose two improvements to the MDI method to overcome its shortcomings. The first is using our proposed Purity Gap Gain (PGG) measure which has an emphasis on computational efficiency, as an alternative to the Gini Importance (GI) metric. The second is incorporating a Relative Mean Decrease Impurity (RMDI) score, which aims to offset the bias towards multi-valued predictor variables through random feature value permutations. Experiments are conducted on UCI datasets to establish the impact of PGG and RMDI on RF performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Note that although \(X_k\) is drawn from X, the latter is a set while the former is a multiset.
2.
Information about node depth is not considered in the current implementation.
3.
2 cores running at 2.70 GHz and 2 GB of RAM.
4.
This value is larger than N, the size of a training set, X which is drawn from a data set.
5.
Sorting is done in ascending order, hence the list is arranged from weakest to strongest.
6.
This dataset is easier to analyze since it has fewer features.

References

Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(3), 131–156 (1997)
Article Google Scholar
Genuer, R., Poggi, J.M., Tuleau-Malot, C.: Variable selection using random forests. Pattern Recogn. Lett. 31(14), 2225–2236 (2010)
Article Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Hu, C., Chen, Y., Hu, L., Peng, X.: A novel random forests based class incremental learning method for activity recognition. Pattern Recogn. 78, 277–290 (2018)
Article Google Scholar
Kawakubo, H., Yoshida, H.: Rapid feature selection based on random forests for high-dimensional data. Expert Syst. Appl. 40, 6241–6252 (2012)
Google Scholar
Lal, T.N., Chapelle, O., Weston, J., Elisseeff, A.: Embedded methods. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds.) Feature Extraction, pp. 137–165. Springer, Heidelberg (2006). https://doi.org/10.1007/978-3-540-35488-8_6
Chapter Google Scholar
Liaw, A., Wiener, M., et al.: Classification and regression by randomForest. R News 2(3), 18–22 (2002)
Google Scholar
Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)
Google Scholar
Newman, C.B.D., Merz, C.: UCI repository of machine learning databases (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
Nguyen, T.T., Huang, J.Z., Nguyen, T.T.: Unbiased feature selection in learning random forests for high-dimensional data. Sci. World J. 2015 - volume number (2015)
Google Scholar
Robnik-Šikonja, M.: Improving random forests. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 359–370. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30115-8_34
Chapter Google Scholar
Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26, 1651–1686 (1998)
Article MathSciNet Google Scholar
Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 8(1), 25 (2007)
Article Google Scholar
Surname, N.: Publication details withheld for peer review purposes. In: Publication, pp. 1–5000. Organization (2030)
Google Scholar
Wang, Y., Xia, S.T.: Unifying attribute splitting criteria of decision trees by Tsallis entropy. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2507–2511. IEEE (2017)
Google Scholar
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

University of KwaZulu-Natal, Private Bag X54001, Durban, 4000, South Africa
Mandlenkosi Victor Gwetu, Jules-Raymond Tapamo & Serestina Viriri

Authors

Mandlenkosi Victor Gwetu
View author publications
You can also search for this author in PubMed Google Scholar
Jules-Raymond Tapamo
View author publications
You can also search for this author in PubMed Google Scholar
Serestina Viriri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mandlenkosi Victor Gwetu .

Editor information

Editors and Affiliations

Ton Duc Thang University, Ho Chi Minh City, Vietnam
Ngoc Thanh Nguyen
University of Pau and Pays de l’Adour, Pau, France
Richard Chbeir
University of Pau and Pays de l’Adour, Pau, France
Ernesto Exposito
University of Pau and Pays de l’Adour, Pau, France
Philippe Aniorté
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gwetu, M.V., Tapamo, JR., Viriri, S. (2019). Exploring the Impact of Purity Gap Gain on the Efficiency and Effectiveness of Random Forest Feature Selection. In: Nguyen, N., Chbeir, R., Exposito, E., Aniorté, P., Trawiński, B. (eds) Computational Collective Intelligence. ICCCI 2019. Lecture Notes in Computer Science(), vol 11683. Springer, Cham. https://doi.org/10.1007/978-3-030-28377-3_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-28377-3_28
Published: 09 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28376-6
Online ISBN: 978-3-030-28377-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics