An efficient random forests algorithm for high dimensional data classification

Wang, Qiang; Nguyen, Thanh-Tung; Huang, Joshua Z.; Nguyen, Thuy Thi

doi:10.1007/s11634-018-0318-1

An efficient random forests algorithm for high dimensional data classification

Regular Article
Published: 21 March 2018

Volume 12, pages 953–972, (2018)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Qiang Wang¹,
Thanh-Tung Nguyen^2,4,
Joshua Z. Huang¹ &
…
Thuy Thi Nguyen³

1003 Accesses
24 Citations
Explore all metrics

Abstract

In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

Extensions to Quantile Regression Forests for Very High-Dimensional Data

On Dynamic Selection of Subspace for Random Forest

Notes

References

Amaratunga D, Cabrera J, Lee YS (2008) Enriched random forests. Bioinformatics 24(18):2010–2014
Article Google Scholar
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29:173–180
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton
MATH Google Scholar
Deng H (2013) Guided random forest in the rrf package. arXiv preprint arXiv:1306.0237
Deng H, Runger G (2013) Gene selection with guided regularized random forest. Pattern Recognit 46(12):3483–3489
Article Google Scholar
Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157
Article Google Scholar
Donoho DL et al (2000) High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math Challenges Lecture, pp 1–32
Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31(14):2225–2236
Article Google Scholar
Georghiades AS, Belhumeur PN, Kriegman D (2001) From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach Intell 23(6):643–660
Article Google Scholar
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844
Article Google Scholar
Lepetit V, Fua P (2006) Keypoint recognition using randomized trees. IEEE Trans Pattern Anal Mach Intell 28(9):1465–1479
Article Google Scholar
Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2(3):18–22
Google Scholar
Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of randomized trees. In: Advances in neural information processing systems, pp 431–439
Meinshausen N (2012) quantregforest: quantile regression forests. R package version 02-3
Nguyen TT, Huang J, Nguyen T (2015) Two-level quantile regression forests for bias correction in range prediction. Mach Learn 101(1–3):325–343
Article MathSciNet Google Scholar
Samaria FS, Harter AC (1994) Parameterisation of a stochastic model for human face identification. In: Proceedings of the second IEEE workshop on applications of computer vision. IEEE, pp 138–142
Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86
Article Google Scholar
Tuv E, Borisov A, Runger G, Torkkola K (2009) Feature selection with ensembles, artificial variables, and redundancy elimination. J Mach Learn Res 10:1341–1366
MathSciNet MATH Google Scholar
Viswanathan V, Sen A, Chakraborty S (2011) Stochastic greedy algorithms: a leaning based approach to combinatorial optimization. Int J Adv Softw 4(1 and 2):1–11
Google Scholar
Xu B, Huang JZ, Williams G, Wang Q, Ye Y (2012) Classifying very high-dimensional data with random forests built from small subspaces. Int J Data Warehous Min 8(2):44–63
Article Google Scholar
Ye Y, Wu Q, Zhexue Huang J, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognit 46(3):769–787
Article Google Scholar
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Article Google Scholar

Download references

Acknowledgements

Part of this work was done while the author Thanh-Tung Nguyen was visiting the Department of Computer Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen 518055, and the College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.

Author information

Authors and Affiliations

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China
Qiang Wang & Joshua Z. Huang
Faculty of Computer Science and Engineering, Thuyloi University, 175 Tay Son, Dong Da, Hanoi, Vietnam
Thanh-Tung Nguyen
Faculty of Information Technology, Vietnam National University of Agriculture, Hanoi, Vietnam
Thuy Thi Nguyen
Sorbonne Université, IRD, JEAI WARM, Unité de Modélisation Mathématiques et Informatique des Systèmes Complexes, UMMISCO, 93143, Bondy, France
Thanh-Tung Nguyen

Authors

Qiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Thanh-Tung Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Z. Huang
View author publications
You can also search for this author in PubMed Google Scholar
Thuy Thi Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thanh-Tung Nguyen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Q., Nguyen, TT., Huang, J.Z. et al. An efficient random forests algorithm for high dimensional data classification. Adv Data Anal Classif 12, 953–972 (2018). https://doi.org/10.1007/s11634-018-0318-1

Download citation

Received: 15 December 2014
Revised: 15 June 2017
Accepted: 13 March 2018
Published: 21 March 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11634-018-0318-1

Keywords

Mathematics Subject Classification

68T01

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient random forests algorithm for high dimensional data classification

Abstract

Access this article

Similar content being viewed by others

A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

Extensions to Quantile Regression Forests for Very High-Dimensional Data

On Dynamic Selection of Subspace for Random Forest

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

An efficient random forests algorithm for high dimensional data classification

Abstract

Access this article

Similar content being viewed by others

A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

Extensions to Quantile Regression Forests for Very High-Dimensional Data

On Dynamic Selection of Subspace for Random Forest

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation