Skip to main content

A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9078))

Included in the following conference series:

Abstract

Random Forests (RF) models have been proven to perform well in both classification and regression. However, with the randomizing mechanism in both bagging samples and feature selection, the performance of RF can deteriorate when applied to high-dimensional data. In this paper, we propose a new approach for feature sampling for RF to deal with high-dimensional data. We first apply \(p\)-value to assess the feature importance on finding a cut-off between informative and less informative features. The set of informative features is then further partitioned into two groups, highly informative and informative features, using some statistical measures. When sampling the feature subspace for learning RFs, features from the three groups are taken into account. The new subspace sampling method maintains the diversity and the randomness of the forest and enables one to generate trees with a lower prediction error. In addition, quantile regression is employed to obtain predictions in the regression problem for a robustness towards outliers. The experimental results demonstrated that the proposed approach for learning random forests significantly reduced prediction errors and outperformed most existing random forests when dealing with high-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  2. Breiman, L.: Manual on setting up, using, and understanding random forests v3. 1. (2002) (retrieved October 23, 2010)

    Google Scholar 

  3. Nguyen, T.T., Huang, J., Nguyen, T.: Two-level quantile regression forests for bias correction in range prediction. Machine Learning, 1–19 (2014)

    Google Scholar 

  4. Tuv, E., Borisov, A., Runger, G., Torkkola, K.: Feature selection with ensembles, artificial variables, and redundancy elimination. The Journal of Machine Learning Research 10, 1341–1366 (2009)

    MATH  MathSciNet  Google Scholar 

  5. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC Press (1984)

    Google Scholar 

  6. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)

    Google Scholar 

  7. Genuer, R., Poggi, J.M., Tuleau-Malot, C.: Variable selection using random forests. Pattern Recognition Letters 31(14), 2225–2236 (2010)

    Article  Google Scholar 

  8. Welch, B.L.: The generalization ofstudent‘s’ problem when several different population variances are involved. Biometrika, 28–35 (1947)

    Google Scholar 

  9. Meinshausen, N.: Quantile regression forests. The Journal of Machine Learning Research 7, 983–999 (2006)

    MATH  MathSciNet  Google Scholar 

  10. Ho, C.H., Lin, C.J.: Large-scale linear support vector regression. The Journal of Machine Learning Research 13(1), 3323–3348 (2012)

    MATH  MathSciNet  Google Scholar 

  11. Cai, Z., Jermaine, C., Vagena, Z., Logothetis, D., Perez, L.L.: The pairwise gaussian random field for high-dimensional data imputation. In: Data Mining (ICDM), pp. 61–70. IEEE (2013)

    Google Scholar 

  12. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)

    Google Scholar 

  13. Meinshausen, N.: quantregforest: quantile regression forests. R package version 0.2-3 (2012)

    Google Scholar 

  14. Hothorn, T., Hornik, K., Zeileis, A.: party: A laboratory for recursive part (y) itioning. r package version 0.9-9999 (2011). http://cran.r-project.org/package=party (date last accessed November 28, 2013)

  15. Deng, H.: Guided random forest in the rrf package. arXiv preprint arXiv:1306.0237 (2013)

  16. Deng, H., Runger, G.: Gene selection with guided regularized random forest. Pattern Recognition 46(12), 3483–3489 (2013)

    Article  Google Scholar 

  17. Ye, Y., Wu, Q., Zhexue Huang, J., Ng, M.K., Li, X.: Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognition 46(3), 769–787 (2013)

    Google Scholar 

  18. Tung, N.T., Huang, J.Z., Khan, I., Li, M.J., Williams, G.: Extensions to Quantile Regression Forests for Very High-Dimensional Data. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part II. LNCS, vol. 8444, pp. 247–258. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Junjie Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Nguyen, TT., Zhao, H., Huang, J.Z., Nguyen, T.T., Li, M.J. (2015). A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18032-8_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18031-1

  • Online ISBN: 978-3-319-18032-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics