Learning from high dimensional data based on weighted feature importance in decision tree ensembles

Pour, Nayiri Galestian; Shemehsavar, Soudabeh

doi:10.1007/s00180-023-01347-3

Learning from high dimensional data based on weighted feature importance in decision tree ensembles

Original paper
Published: 07 April 2023

Volume 39, pages 313–342, (2024)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Nayiri Galestian Pour¹ &
Soudabeh Shemehsavar¹

227 Accesses
Explore all metrics

Abstract

Learning from high dimensional data has been utilized in various applications such as computational biology, image classification, and finance. Most classical machine learning algorithms fail to give accurate predictions in high dimensional settings due to the enormous feature space. In this article, we present a novel ensemble of classification trees based on weighted random subspaces that aims to adjust the distribution of selection probabilities. In the proposed algorithm base classifiers are built on random feature subspaces in which the probability that influential features will be selected for the next subspace, is updated by incorporating grouping information based on previous classifiers through a weighting function. As an interpretation tool, we show that variable importance measures computed by the new method can identify influential features efficiently. We provide theoretical reasoning for the different elements of the proposed method, and we evaluate the usefulness of the new method based on simulation studies and real data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble feature selection for high dimensional data: a new method and a comparative study

Article 24 April 2017

Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains

Article Open access 25 February 2019

New feature selection and voting scheme to improve classification accuracy

Article 11 January 2019

References

Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL (2007) Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal 51(12):6166–6179. https://doi.org/10.1016/j.csda.2006.12.043
Article MathSciNet Google Scholar
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750. https://doi.org/10.1073/pnas.96.12.6745
Article Google Scholar
Amaratunga D, Cabrera J, Lee Y-S (2008) Enriched random forests. Bioinformatics 24(18):2010–2014. https://doi.org/10.1093/bioinformatics/btn356
Article Google Scholar
Bay SD (1998) Combining nearest neighbor classifiers through multiple feature subsets. ICML 98:37–45
Google Scholar
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P et al (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci 98(24):13790–13795
Article Google Scholar
Blaser R, Fryzlewicz P (2016) Random rotation ensembles. J Mach Learn Res 17(1):126–151
MathSciNet Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Article Google Scholar
Canedo V, Marono N, Betanzos A (2015) Feature selection for high-dimensional data. Springer, Switzerland
Book Google Scholar
Cannings TI, Samworth RJ (2017) Random-projection ensemble classification. J R Stat Soc Ser B (Stat Methodol) 79(4):959–1035. https://doi.org/10.1111/rssb.12228
Article MathSciNet Google Scholar
Chipman HA, George EI, McCulloch RE (2010) Bart: Bayesian additive regression trees. Ann Appl Stat 4(1):266–298. https://doi.org/10.1214/09-AOAS285
Article MathSciNet Google Scholar
Corsetti MA, Love TM (2022) Grafted and vanishing random subspaces. Pattern Anal Appl 25(1):89–124
Article Google Scholar
Deegalla S, Walgama K, Papapetrou P, Boström H (2022) Random subspace and random projection nearest neighbor ensembles for high dimensional data. Expert Syst Appl 191:116078. https://doi.org/10.1016/j.eswa.2021.116078
Article Google Scholar
Fokoué E, Elshrif M (2015) Improvement of predictive performance via random subspace learning with data driven weighting schemes. Rochester Institute of Techonology, Rochester
Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. https://doi.org/10.1006/jcss.1997.1504
Article MathSciNet Google Scholar
Fu G-H, Wu Y-J, Zong M-J, Pan J (2020) Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data. BMC Bioinform 21(1):1–14. https://doi.org/10.1186/s12859-020-3411-3
Article Google Scholar
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42. https://doi.org/10.1007/s10994-006-6226-1
Article Google Scholar
Giraud C (2014) Introduction to high-dimensional statistics. Chapman and Hall/CRC, Boca Raton
Book Google Scholar
Gravier E, Pierron G, Vincent-Salomon A, Gruel N, Raynal V, Savignoni A et al (2010) A prognostic DNA signature for t1t2 node-negative breast cancer patients. Genes Chromosomes Cancer 49(12):1125–1134. https://doi.org/10.1002/gcc.20820
Article Google Scholar
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844. https://doi.org/10.1109/34.709601
Article Google Scholar
Ishwaran H, Lu M (2019) Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat Med 38(4):558–582. https://doi.org/10.1002/sim.7803
Article MathSciNet Google Scholar
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2(3):841–860. https://doi.org/10.1214/08-AOAS169
Article MathSciNet Google Scholar
Kyrillidis A, Zouzias A (2014) Non-uniform feature sampling for decision tree ensembles. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 4548–4552). https://doi.org/10.1109/ICASSP.2014.6854463
Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: efficient online random forests. Adv Neural Inf Process Syst 27
Linero AR (2018) Bayesian regression trees for high-dimensional prediction and variable selection. J Am Stat Assoc 113(522):626–636. https://doi.org/10.1080/01621459.2016.1264957
Article MathSciNet Google Scholar
Liu Y, Zhao H (2017) Variable importance-weighted random forests. Quant Biol 5(4):338–351. https://doi.org/10.1007/s40484-017-0121-6
Article MathSciNet Google Scholar
Nutt CL, Mani D, Betensky RA, Tamayo P, Cairncross JG, Ladd C et al (2003) Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 63(7):1602–1607
Google Scholar
Piao Y, Piao M, Jin CH, Shon HS, Chung J-M, Hwang B, Ryu KH (2015) A new ensemble method with feature space partitioning for high-dimensional data classification. Math Probl Eng. https://doi.org/10.1155/2015/590678
Article MathSciNet Google Scholar
Ramos AL (2016) Evolutionary weights for random subspace learning. Rochester Institute of Technology, Rochester
Google Scholar
Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28(10):1619–1630. https://doi.org/10.1109/TPAMI.2006.211
Article Google Scholar
Serpen G, Pathical S (2009) Classification in high-dimensional feature spaces: Random subsample ensemble. In: 2009 international conference on machine learning and applications (pp 740–745). https://doi.org/10.1109/ICMLA.2009.26
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC et al (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74. https://doi.org/10.1038/nm0102-68
Article Google Scholar
Simm J, De Abril IM, Sugiyama M (2014) Tree-based ensemble multi-task learning method for classification and regression. IEICE Trans Inf Syst 97(6):1677–1681. https://doi.org/10.1587/transinf.E97.D.1677
Article Google Scholar
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209. https://doi.org/10.1016/S1535-6108(02)00030-2
Article Google Scholar
Strobl C, Boulesteix A-L, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1):1–21. https://doi.org/10.1186/1471-2105-8-25
Article Google Scholar
Sutera A, Châtel C, Louppe G, Wehenkel L, Geurts P (2018) Random subspace with trees for feature selection under memory constraints. In: International conference on artificial intelligence and statistics (pp 929–937)
Tian Y, Feng Y (2021) Rase: random subspace ensemble classification. J Mach Learn Res 22:1–45
MathSciNet Google Scholar
Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999. https://doi.org/10.1109/72.788640
Article Google Scholar
Wickham H (2016) ggplot2: Elegant graphics for data analysis. Springer-Verlag, New York. Retrieved from https://ggplot2.tidyverse.org
Wright MN, Ziegler A (2017) ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77(1):1–17. https://doi.org/10.18637/jss.v077.i01
Xu B, Huang JZ, Williams G, Wang Q, Ye Y (2012) Classifying very high-dimensional data with random forests built from small subspaces. Int J Data Warehous Min (IJDWM) 8(2):44–63. https://doi.org/10.4018/jdwm.2012040103
Article Google Scholar
Xu H, Lin T, Xie Y, Chen Z (2018) Enriching the random subspace method with margin theory—a solution for the high-dimensional classification task. Connect Sci 30(4):409–424
Article Google Scholar
Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinform 7(1):1–16. https://doi.org/10.1186/1471-2105-7-228
Article Google Scholar
Ye Y, Wu Q, Huang JZ, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognit 46(3):769–787. https://doi.org/10.1016/j.patcog.2012.09.005
Article Google Scholar
Zhao H, Williams GJ, Huang JZ (2017) WSRF: an R package for classification with scalable weighted subspace random forests. J Stat Softw 77:1–30
Article Google Scholar
Zhou Z-H (2019) Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, Boca Raton
Google Scholar
Zhu R, Zeng D, Kosorok MR (2015) Reinforcement learning trees. J Am Stat Assoc 110(512):1770–1784. https://doi.org/10.1080/01621459.2015.1036994
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematics, Statistics and Computer Science, University of Tehran, Tehran, Iran
Nayiri Galestian Pour & Soudabeh Shemehsavar

Authors

Nayiri Galestian Pour
View author publications
You can also search for this author in PubMed Google Scholar
Soudabeh Shemehsavar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soudabeh Shemehsavar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pour, N.G., Shemehsavar, S. Learning from high dimensional data based on weighted feature importance in decision tree ensembles. Comput Stat 39, 313–342 (2024). https://doi.org/10.1007/s00180-023-01347-3

Download citation

Received: 23 June 2022
Accepted: 20 February 2023
Published: 07 April 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s00180-023-01347-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from high dimensional data based on weighted feature importance in decision tree ensembles

Abstract

Access this article

Similar content being viewed by others

Ensemble feature selection for high dimensional data: a new method and a comparative study

Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains

New feature selection and voting scheme to improve classification accuracy

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning from high dimensional data based on weighted feature importance in decision tree ensembles

Abstract

Access this article

Similar content being viewed by others

Ensemble feature selection for high dimensional data: a new method and a comparative study

Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains

New feature selection and voting scheme to improve classification accuracy

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation