Skip to main content

Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data

  • Conference paper
  • First Online:
Intelligence and Security Informatics (PAISI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9650))

Included in the following conference series:

Abstract

Imbalanced data presents a big challenge to random forests (RF). Over-sampling is a commonly used sampling method for imbalanced data, which increases the number of instances of minority class to balance the class distribution. However, such method often produces sample data sets that are highly correlated if we only sample more minority class instances, thus reducing the generalizability of RF. To solve this problem, we propose a stratified over-sampling (SOB) method to generate both balanced and diverse training data sets for RF. We first cluster the training data set multiple times to produce multiple clustering results. The small individual clusters are grouped according to their entropies. Then we sample a set of training data sets from the groups of clusters using stratified sampling method. Finally, these training data sets are used to train RF. The data sets sampled with SOB are guaranteed to be balanced and diverse, which improves the performance of RF on imbalanced data. We have conducted a series of experiments, and the experimental results have shown that the proposed method is more effective than some existing sampling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://archive.ics.uci.edu/ml/index.html.

References

  1. Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: A comparison of decision tree ensemble creation techniques. IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 173ā€“180 (2007)

    ArticleĀ  Google ScholarĀ 

  2. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123ā€“140 (1996)

    MathSciNetĀ  MATHĀ  Google ScholarĀ 

  3. Breiman, L.: Random forests. Mach. Learn. 45(1), 5ā€“32 (2001)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. CRC Press, Boca Raton (1984)

    MATHĀ  Google ScholarĀ 

  5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321ā€“357 (2002)

    MATHĀ  Google ScholarĀ 

  6. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107ā€“119. Springer, Heidelberg (2003)

    ChapterĀ  Google ScholarĀ 

  7. Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical report TR.666, University of California, Berkeley, California (2004)

    Google ScholarĀ 

  8. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263ā€“1284 (2009)

    ArticleĀ  Google ScholarĀ 

  9. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832ā€“844 (1998)

    ArticleĀ  Google ScholarĀ 

  10. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6(1), 40ā€“49 (2004)

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  11. Krawczyk, B., Wozniak, M., Schaefer, G.: Improving minority class prediction using cost-sensitive ensembles. In: 16th Online World Conference on Soft Computing in Industrial Applications (2011)

    Google ScholarĀ 

  12. Liu, Y., Yu, X., Huang, J.X., An, A.: Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Inf. Process. Manag. 47(4), 617ā€“631 (2011)

    ArticleĀ  Google ScholarĀ 

  13. Nguyen, T., Huang, J.Z., Nguyen, T.T.: Two-level quantile regression forests for bias correction in range prediction. Mach. Learn. 101(1ā€“3), 325ā€“343 (2015)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  14. NĆŗƱez, M.: The use of background knowledge in decision tree induction. Mach. Learn. 6, 231ā€“250 (1991)

    Google ScholarĀ 

  15. Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V.: Hybrid sampling for imbalanced data. In: Proceedings of the IEEE International Conference on Information Reuse and Integration 2008, Las Vegas, Nevada, USA, pp. 202ā€“207, 13ā€“15 July 2008

    Google ScholarĀ 

  16. Xu, B., Huang, J.Z., Williams, G.J., Wang, Q., Ye, Y.: Classifying very high-dimensional data with random forests built from small subspaces. Int. J. Data Warehous. Min. 8(2), 44ā€“63 (2012)

    ArticleĀ  Google ScholarĀ 

  17. Ye, Y., Wu, Q., Huang, J.Z., Ng, M.K., Li, X.: Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn. 46(3), 769ā€“787 (2013)

    ArticleĀ  Google ScholarĀ 

  18. Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718ā€“5727 (2009)

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

Download references

Acknowledgments

This work was supported by Guangdong Fund under Grant No. 2013B091300019, NSFC under Grant No. 61305059 and No. 61473194, and Natural Science Foundation of SZU (Grant No. 201432).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to He Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhao, H., Chen, X., Nguyen, T., Huang, J.Z., Williams, G., Chen, H. (2016). Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data. In: Chau, M., Wang, G., Chen, H. (eds) Intelligence and Security Informatics. PAISI 2016. Lecture Notes in Computer Science(), vol 9650. Springer, Cham. https://doi.org/10.1007/978-3-319-31863-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31863-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31862-2

  • Online ISBN: 978-3-319-31863-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics