Skip to main content

Exploring the Impact of Purity Gap Gain on the Efficiency and Effectiveness of Random Forest Feature Selection

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11683))

Abstract

The Random Forest (RF) classifier has the capacity to facilitate both wrapper and embedded feature selection through the Mean Decrease Accuracy (MDA) and Mean Decrease Impurity (MDI) methods, respectively. MDI is known to be biased towards predictor variables with multiple values whilst MDA is stable in this regard. As such, MDA is the predominantly preferred option for RF-based feature selection, despite its higher computational overhead in comparison to MDI. This research seeks to simultaneously reduce the computational overhead and improve the effectiveness of RF feature selection. We propose two improvements to the MDI method to overcome its shortcomings. The first is using our proposed Purity Gap Gain (PGG) measure which has an emphasis on computational efficiency, as an alternative to the Gini Importance (GI) metric. The second is incorporating a Relative Mean Decrease Impurity (RMDI) score, which aims to offset the bias towards multi-valued predictor variables through random feature value permutations. Experiments are conducted on UCI datasets to establish the impact of PGG and RMDI on RF performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Note that although \(X_k\) is drawn from X, the latter is a set while the former is a multiset.

  2. 2.

    Information about node depth is not considered in the current implementation.

  3. 3.

    2 cores running at 2.70 GHz and 2 GB of RAM.

  4. 4.

    This value is larger than N, the size of a training set, X which is drawn from a data set.

  5. 5.

    Sorting is done in ascending order, hence the list is arranged from weakest to strongest.

  6. 6.

    This dataset is easier to analyze since it has fewer features.

References

  1. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MATH  Google Scholar 

  2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  3. Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(3), 131–156 (1997)

    Article  Google Scholar 

  4. Genuer, R., Poggi, J.M., Tuleau-Malot, C.: Variable selection using random forests. Pattern Recogn. Lett. 31(14), 2225–2236 (2010)

    Article  Google Scholar 

  5. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  6. Hu, C., Chen, Y., Hu, L., Peng, X.: A novel random forests based class incremental learning method for activity recognition. Pattern Recogn. 78, 277–290 (2018)

    Article  Google Scholar 

  7. Kawakubo, H., Yoshida, H.: Rapid feature selection based on random forests for high-dimensional data. Expert Syst. Appl. 40, 6241–6252 (2012)

    Google Scholar 

  8. Lal, T.N., Chapelle, O., Weston, J., Elisseeff, A.: Embedded methods. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds.) Feature Extraction, pp. 137–165. Springer, Heidelberg (2006). https://doi.org/10.1007/978-3-540-35488-8_6

    Chapter  Google Scholar 

  9. Liaw, A., Wiener, M., et al.: Classification and regression by randomForest. R News 2(3), 18–22 (2002)

    Google Scholar 

  10. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)

    Google Scholar 

  11. Newman, C.B.D., Merz, C.: UCI repository of machine learning databases (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html

  12. Nguyen, T.T., Huang, J.Z., Nguyen, T.T.: Unbiased feature selection in learning random forests for high-dimensional data. Sci. World J. 2015 - volume number (2015)

    Google Scholar 

  13. Robnik-Šikonja, M.: Improving random forests. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 359–370. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30115-8_34

    Chapter  Google Scholar 

  14. Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26, 1651–1686 (1998)

    Article  MathSciNet  Google Scholar 

  15. Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 8(1), 25 (2007)

    Article  Google Scholar 

  16. Surname, N.: Publication details withheld for peer review purposes. In: Publication, pp. 1–5000. Organization (2030)

    Google Scholar 

  17. Wang, Y., Xia, S.T.: Unifying attribute splitting criteria of decision trees by Tsallis entropy. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2507–2511. IEEE (2017)

    Google Scholar 

  18. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mandlenkosi Victor Gwetu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gwetu, M.V., Tapamo, JR., Viriri, S. (2019). Exploring the Impact of Purity Gap Gain on the Efficiency and Effectiveness of Random Forest Feature Selection. In: Nguyen, N., Chbeir, R., Exposito, E., Aniorté, P., Trawiński, B. (eds) Computational Collective Intelligence. ICCCI 2019. Lecture Notes in Computer Science(), vol 11683. Springer, Cham. https://doi.org/10.1007/978-3-030-28377-3_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-28377-3_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-28376-6

  • Online ISBN: 978-3-030-28377-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics