Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data

Zhao, He; Chen, Xiaojun; Nguyen, Tung; Huang, Joshua Zhexue; Williams, Graham; Chen, Hui

doi:10.1007/978-3-319-31863-9_5

He Zhao¹⁶,
Xiaojun Chen¹⁷,
Tung Nguyen¹⁸,
Joshua Zhexue Huang¹⁷,
Graham Williams¹⁹ &
…
Hui Chen¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9650))

Included in the following conference series:

Pacific-Asia Workshop on Intelligence and Security Informatics

1352 Accesses
5 Citations

Abstract

Imbalanced data presents a big challenge to random forests (RF). Over-sampling is a commonly used sampling method for imbalanced data, which increases the number of instances of minority class to balance the class distribution. However, such method often produces sample data sets that are highly correlated if we only sample more minority class instances, thus reducing the generalizability of RF. To solve this problem, we propose a stratified over-sampling (SOB) method to generate both balanced and diverse training data sets for RF. We first cluster the training data set multiple times to produce multiple clustering results. The small individual clusters are grouped according to their entropies. Then we sample a set of training data sets from the groups of clusters using stratified sampling method. Finally, these training data sets are used to train RF. The data sets sampled with SOB are guaranteed to be balanced and diverse, which improves the performance of RF on imbalanced data. We have conducted a series of experiments, and the experimental results have shown that the proposed method is more effective than some existing sampling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Article 13 April 2022

A Novel Random Forest Approach Using Specific Under Sampling Strategy

Notes

1.
http://archive.ics.uci.edu/ml/index.html.

References

Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: A comparison of decision tree ensemble creation techniques. IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 173–180 (2007)
Article Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
MathSciNet MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MathSciNet MATH Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
MATH Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)
Chapter Google Scholar
Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical report TR.666, University of California, Berkeley, California (2004)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)
Article Google Scholar
Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6(1), 40–49 (2004)
Article MathSciNet Google Scholar
Krawczyk, B., Wozniak, M., Schaefer, G.: Improving minority class prediction using cost-sensitive ensembles. In: 16th Online World Conference on Soft Computing in Industrial Applications (2011)
Google Scholar
Liu, Y., Yu, X., Huang, J.X., An, A.: Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Inf. Process. Manag. 47(4), 617–631 (2011)
Article Google Scholar
Nguyen, T., Huang, J.Z., Nguyen, T.T.: Two-level quantile regression forests for bias correction in range prediction. Mach. Learn. 101(1–3), 325–343 (2015)
Article MathSciNet MATH Google Scholar
Núñez, M.: The use of background knowledge in decision tree induction. Mach. Learn. 6, 231–250 (1991)
Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V.: Hybrid sampling for imbalanced data. In: Proceedings of the IEEE International Conference on Information Reuse and Integration 2008, Las Vegas, Nevada, USA, pp. 202–207, 13–15 July 2008
Google Scholar
Xu, B., Huang, J.Z., Williams, G.J., Wang, Q., Ye, Y.: Classifying very high-dimensional data with random forests built from small subspaces. Int. J. Data Warehous. Min. 8(2), 44–63 (2012)
Article Google Scholar
Ye, Y., Wu, Q., Huang, J.Z., Ng, M.K., Li, X.: Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn. 46(3), 769–787 (2013)
Article Google Scholar
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work was supported by Guangdong Fund under Grant No. 2013B091300019, NSFC under Grant No. 61305059 and No. 61473194, and Natural Science Foundation of SZU (Grant No. 201432).

Author information

Authors and Affiliations

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
He Zhao & Hui Chen
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Xiaojun Chen & Joshua Zhexue Huang
Faculty of Computer Science and Engineering, Thuyloi University, Hanoi, Vietnam
Tung Nguyen
Australian National University, Canberra, Australia
Graham Williams

Authors

He Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tung Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Zhexue Huang
View author publications
You can also search for this author in PubMed Google Scholar
Graham Williams
View author publications
You can also search for this author in PubMed Google Scholar
Hui Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to He Zhao .

Editor information

Editors and Affiliations

The University of Hong Kong, Hong Kong, Hong Kong
Michael Chau
Virginia Tech, Blacksburg, Virginia, USA
G. Alan Wang
The University of Arizona, Tucson, Arizona, USA
Hsinchun Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, H., Chen, X., Nguyen, T., Huang, J.Z., Williams, G., Chen, H. (2016). Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data. In: Chau, M., Wang, G., Chen, H. (eds) Intelligence and Security Informatics. PAISI 2016. Lecture Notes in Computer Science(), vol 9650. Springer, Cham. https://doi.org/10.1007/978-3-319-31863-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-31863-9_5
Published: 29 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31862-2
Online ISBN: 978-3-319-31863-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

A Novel Random Forest Approach Using Specific Under Sampling Strategy

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

A Novel Random Forest Approach Using Specific Under Sampling Strategy

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation