Skip to main content

Clustering Based Bagging Algorithm on Imbalanced Data Sets

  • Conference paper
Integrated Uncertainty in Knowledge Modelling and Decision Making (IUKM 2011)

Abstract

The approach of under-sampling the majority class is an effective method in dealing with classifying imbalanced data sets, but it has the deficiency of ignoring useful information. In order to eliminate this deficiency, we propose a Clustering Based Bagging Algorithm (CBBA). In CBBA, the majority class is clustered into several groups and instances are randomly sampled from each group. Those sampled instances are combined together with the minority class instances, and are used to train a base classifier. Final predictions are produced by combining those classifiers. The experimental results show that our approach outperforms the under-sampling method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Weiss, G.M.: Mining with rarity: A unifying framework. Chicago, IL, USA. SIGKDD Explorations 6(1), 7–19 (2004)

    Article  Google Scholar 

  2. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-Sided selection. In: 14th International Conference on Machine Learning, Tennessee, pp. 179–186 (1997)

    Google Scholar 

  3. Hido, S., Kashima, H.: Roughly Balanced Bagging for Imbalanced Data. In: 2008 SIAM International Conference on Data Mining, pp. 143–152 (2008)

    Google Scholar 

  4. Yen, S.-J., Lee, Y.-S.: Cluster-based sampling approaches to imbalanced data distributions. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 427–436. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 341–378 (2002)

    MATH  Google Scholar 

  6. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  7. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explorations 6(1), 40–49 (2004)

    Article  Google Scholar 

  8. Drummond, C., Holter, C.: C4.5, class imbalance and cost sensitivity: Why under-sampling beats over-sampling. In: ICML Workshop on Learning from Imbalaneed Data Sets, Washington D.C (2003)

    Google Scholar 

  9. Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: AdaCost: Misclassification Cost-sensitive boosting. In: 16th International Conference on Machine Learning, Bled, Slovenia, pp. 97–105 (1999)

    Google Scholar 

  10. Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: 3rd IEEE International Conference on Data Mining, pp. 435–442 (2003)

    Google Scholar 

  11. Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Journal of Pattern Recognition 40(12), 3358–3375 (2007)

    Article  MATH  Google Scholar 

  12. Lin, Y., Lee, Y., Wahba, G.: Support Vector Machines for Classification in Nonstandard Situations. Machine Learning 46(1-3), 191–202 (2002)

    Article  MATH  Google Scholar 

  13. Wu, G., Chang, E.Y.: KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution. IEEE Transactions on Knowledge and Data Engineering 17(6), 786–795 (2005)

    Article  Google Scholar 

  14. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  15. Guo, H.-Y., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explorations 6(1), 30–39 (2004)

    Article  Google Scholar 

  16. Liu, X.Y., Wu, J.X., Zhou, Z.H.: Exploratory Under-Sampling for Class-Imbalance Learning. In: 6th IEEE International Conference on Data Mining, Hong Kong, China, pp. 539–550 (2006)

    Google Scholar 

  17. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), 1088–1099 (2006)

    Article  Google Scholar 

  18. Li, C.: Classifying Imbalanced Data Using A Bagging Ensemble Variation (BEV). In: 45th Annual Southeast Regional Conference, pp. 203–208 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sun, XY., Zhang, HX., Wang, ZC. (2011). Clustering Based Bagging Algorithm on Imbalanced Data Sets. In: Tang, Y., Huynh, VN., Lawry, J. (eds) Integrated Uncertainty in Knowledge Modelling and Decision Making. IUKM 2011. Lecture Notes in Computer Science(), vol 7027. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24918-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24918-1_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24917-4

  • Online ISBN: 978-3-642-24918-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics