Clustering Based Bagging Algorithm on Imbalanced Data Sets

Sun, Xiao-Yan; Zhang, Hua-Xiang; Wang, Zhi-Chao

doi:10.1007/978-3-642-24918-1_20

Xiao-Yan Sun^22,23,
Hua-Xiang Zhang^22,23 &
Zhi-Chao Wang^22,23

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7027))

Included in the following conference series:

International Symposium on Integrated Uncertainty in Knowledge Modelling and Decision Making

916 Accesses
2 Citations

Abstract

The approach of under-sampling the majority class is an effective method in dealing with classifying imbalanced data sets, but it has the deficiency of ignoring useful information. In order to eliminate this deficiency, we propose a Clustering Based Bagging Algorithm (CBBA). In CBBA, the majority class is clustered into several groups and instances are randomly sampled from each group. Those sampled instances are combined together with the minority class instances, and are used to train a base classifier. Final predictions are produced by combining those classifiers. The experimental results show that our approach outperforms the under-sampling method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Weiss, G.M.: Mining with rarity: A unifying framework. Chicago, IL, USA. SIGKDD Explorations 6(1), 7–19 (2004)
Article Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-Sided selection. In: 14th International Conference on Machine Learning, Tennessee, pp. 179–186 (1997)
Google Scholar
Hido, S., Kashima, H.: Roughly Balanced Bagging for Imbalanced Data. In: 2008 SIAM International Conference on Data Mining, pp. 143–152 (2008)
Google Scholar
Yen, S.-J., Lee, Y.-S.: Cluster-based sampling approaches to imbalanced data distributions. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 427–436. Springer, Heidelberg (2006)
Chapter Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 341–378 (2002)
MATH Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)
Chapter Google Scholar
Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explorations 6(1), 40–49 (2004)
Article Google Scholar
Drummond, C., Holter, C.: C4.5, class imbalance and cost sensitivity: Why under-sampling beats over-sampling. In: ICML Workshop on Learning from Imbalaneed Data Sets, Washington D.C (2003)
Google Scholar
Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: AdaCost: Misclassification Cost-sensitive boosting. In: 16th International Conference on Machine Learning, Bled, Slovenia, pp. 97–105 (1999)
Google Scholar
Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: 3rd IEEE International Conference on Data Mining, pp. 435–442 (2003)
Google Scholar
Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Journal of Pattern Recognition 40(12), 3358–3375 (2007)
Article MATH Google Scholar
Lin, Y., Lee, Y., Wahba, G.: Support Vector Machines for Classification in Nonstandard Situations. Machine Learning 46(1-3), 191–202 (2002)
Article MATH Google Scholar
Wu, G., Chang, E.Y.: KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution. IEEE Transactions on Knowledge and Data Engineering 17(6), 786–795 (2005)
Article Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)
Chapter Google Scholar
Guo, H.-Y., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explorations 6(1), 30–39 (2004)
Article Google Scholar
Liu, X.Y., Wu, J.X., Zhou, Z.H.: Exploratory Under-Sampling for Class-Imbalance Learning. In: 6th IEEE International Conference on Data Mining, Hong Kong, China, pp. 539–550 (2006)
Google Scholar
Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), 1088–1099 (2006)
Article Google Scholar
Li, C.: Classifying Imbalanced Data Using A Bagging Ensemble Variation (BEV). In: 45th Annual Southeast Regional Conference, pp. 203–208 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Science and Engineering, Shandong Normal University, China
Xiao-Yan Sun, Hua-Xiang Zhang & Zhi-Chao Wang
Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology, 250014, Jinan, Shandong, China
Xiao-Yan Sun, Hua-Xiang Zhang & Zhi-Chao Wang

Authors

Xiao-Yan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Hua-Xiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-Chao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Computer Science, Zhejiang University, 310027, Hangzhou, Zhejiang Province, P.R. China
Yongchuan Tang
Japan Advanced Institute of Science and Technology (JAIST), 923-1292, Tatsunokuchi, Ishikawa, Japan
Van-Nam Huynh
Artificial Intelligence Group, Engineering Mathematics Department, University of Bristol, BS8 1TR, UK
Jonathan Lawry

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, XY., Zhang, HX., Wang, ZC. (2011). Clustering Based Bagging Algorithm on Imbalanced Data Sets. In: Tang, Y., Huynh, VN., Lawry, J. (eds) Integrated Uncertainty in Knowledge Modelling and Decision Making. IUKM 2011. Lecture Notes in Computer Science(), vol 7027. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24918-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-24918-1_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24917-4
Online ISBN: 978-3-642-24918-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics