Abstract
Imbalanced data presents a big challenge to random forests (RF). Over-sampling is a commonly used sampling method for imbalanced data, which increases the number of instances of minority class to balance the class distribution. However, such method often produces sample data sets that are highly correlated if we only sample more minority class instances, thus reducing the generalizability of RF. To solve this problem, we propose a stratified over-sampling (SOB) method to generate both balanced and diverse training data sets for RF. We first cluster the training data set multiple times to produce multiple clustering results. The small individual clusters are grouped according to their entropies. Then we sample a set of training data sets from the groups of clusters using stratified sampling method. Finally, these training data sets are used to train RF. The data sets sampled with SOB are guaranteed to be balanced and diverse, which improves the performance of RF on imbalanced data. We have conducted a series of experiments, and the experimental results have shown that the proposed method is more effective than some existing sampling methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: A comparison of decision tree ensemble creation techniques. IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 173ā180 (2007)
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123ā140 (1996)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5ā32 (2001)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321ā357 (2002)
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: LavraÄ, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107ā119. Springer, Heidelberg (2003)
Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical report TR.666, University of California, Berkeley, California (2004)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263ā1284 (2009)
Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832ā844 (1998)
Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6(1), 40ā49 (2004)
Krawczyk, B., Wozniak, M., Schaefer, G.: Improving minority class prediction using cost-sensitive ensembles. In: 16th Online World Conference on Soft Computing in Industrial Applications (2011)
Liu, Y., Yu, X., Huang, J.X., An, A.: Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Inf. Process. Manag. 47(4), 617ā631 (2011)
Nguyen, T., Huang, J.Z., Nguyen, T.T.: Two-level quantile regression forests for bias correction in range prediction. Mach. Learn. 101(1ā3), 325ā343 (2015)
NĆŗƱez, M.: The use of background knowledge in decision tree induction. Mach. Learn. 6, 231ā250 (1991)
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V.: Hybrid sampling for imbalanced data. In: Proceedings of the IEEE International Conference on Information Reuse and Integration 2008, Las Vegas, Nevada, USA, pp. 202ā207, 13ā15 July 2008
Xu, B., Huang, J.Z., Williams, G.J., Wang, Q., Ye, Y.: Classifying very high-dimensional data with random forests built from small subspaces. Int. J. Data Warehous. Min. 8(2), 44ā63 (2012)
Ye, Y., Wu, Q., Huang, J.Z., Ng, M.K., Li, X.: Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn. 46(3), 769ā787 (2013)
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718ā5727 (2009)
Acknowledgments
This work was supported by Guangdong Fund under Grant No. 2013B091300019, NSFC under Grant No. 61305059 and No. 61473194, and Natural Science Foundation of SZU (Grant No. 201432).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhao, H., Chen, X., Nguyen, T., Huang, J.Z., Williams, G., Chen, H. (2016). Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data. In: Chau, M., Wang, G., Chen, H. (eds) Intelligence and Security Informatics. PAISI 2016. Lecture Notes in Computer Science(), vol 9650. Springer, Cham. https://doi.org/10.1007/978-3-319-31863-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-31863-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31862-2
Online ISBN: 978-3-319-31863-9
eBook Packages: Computer ScienceComputer Science (R0)