Skip to main content

Improving Bagging Ensembles for Class Imbalanced Data by Active Learning

  • Chapter
  • First Online:
Advances in Feature Selection for Data and Pattern Recognition

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 138))

Abstract

Extensions of under-sampling bagging ensemble classifiers for class imbalanced data are considered. We propose a two phase approach, called Actively Balanced Bagging, which aims to improve recognition of minority and majority classes with respect to so far proposed extensions of bagging. Its key idea consists in additional improving of an under-sampling bagging classifier (learned in the first phase) by updating in the second phase the bootstrap samples with a limited number of examples selected according to an active learning strategy. The results of an experimental evaluation of Actively Balanced Bagging show that this approach improves predictions of the two different baseline variants of under-sampling bagging. The other experiments demonstrate the differentiated influence of four active selection strategies on the final results and the role of tuning main parameters of the ensemble.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For simplicity margin will be denoted as m - in particular in experiments see Tables 3.3,  3.4,  3.5 and 3.6; further introduced weight update methods will be denoted analogously.

  2. 2.

    http://www.ics.uci.edu/ mlearn/MLRepository.html.

  3. 3.

    We are grateful to prof. W. Michalowski and the MET Research Group from the University of Ottawa for providing us an access to scrotal-pain data set.

  4. 4.

    Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann, Fourth Edition, 2016.

References

  1. Abe, N., Mamitsuka, H.: Query learning strategies using boosting and bagging. In: Proceedings of 15th International Conference on Machine Learning, pp. 1–10 (2004)

    Google Scholar 

  2. Aggarwal, C., X., K., Gu, Q., Han, J., Yu, P.: Data Classification: Algorithms and Applications. Active learning: A survey, pp. 571–606. CRC Press (2015)

    Google Scholar 

  3. Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6, 20–29 (2004). https://doi.org/10.1145/1007730.1007735

    Article  Google Scholar 

  4. Błaszczyński, J., Deckert, M., Stefanowski, J., Wilk, S.: Integrating selective preprocessing of imbalanced data with Ivotes ensemble. In: Proceedings of 7th International Conference RSCTC 2010, LNAI, vol. 6086, pp. 148–157. Springer (2010)

    Google Scholar 

  5. Błaszczyński, J., Lango, M.: Diversity analysis on imbalanced data using neighbourhood and roughly balanced bagging ensembles. In: Proceedings ICAISC 2016, LNCS, vol. 9692, pp. 552–562. Springer (2016)

    Google Scholar 

  6. Błaszczyński, J., Stefanowski, J., Idkowiak, L.: Extending bagging for imbalanced data. In: Proc. of the 8th CORES 2013, Springer Series on Advances in Intelligent Systems and Computing, vol. 226, pp. 226–269 (2013)

    Google Scholar 

  7. Błaszczyński, J., Stefanowski, J.: Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150A, 184–203 (2015)

    Google Scholar 

  8. Błaszczyński, J., Stefanowski, J.: Actively Balanced Bagging for Imbalanced Data. In: Proceedings ISMIS 2017, Springer LNAI, vol. 10352, pp. 271–281 (2017)

    Google Scholar 

  9. Błaszczyński, J., Stefanowski, J.: Local data characteristics in learning classifiers from imbalanced data. In: J. Kacprzyk, L. Rutkowski, A. Gaweda, G. Yen (eds.) Advances in Data Analysis with Computational Intelligence Methods, Studies in Computational Intelligence. p. 738. Springer (2017). https://doi.org/10.1007/978-3-319-67946-4_2 (to appear)

  10. Borisov, A., Tuv, E., Runger, G.: Active Batch Learning with Stochastic Query-by-Forest (SQBF). Work. Act. Learn. Exp. Des. JMLR 16, 59–69 (2011)

    Google Scholar 

  11. Branco, P., Torgo, L., Ribeiro, R.: A survey of predictive modeling under imbalanced distributions. ACM Comput. Surv. 49(2), 31 (2016). https://doi.org/10.1145/2907070

    Article  Google Scholar 

  12. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655

    MATH  Google Scholar 

  13. Chang, E.: Statistical learning for effective visual information retrieval. In: Proceedings of ICIP 2003, pp. 609–612 (2003). https://doi.org/10.1109/ICIP.2003.1247318

  14. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 16, 341–378 (2002)

    MATH  Google Scholar 

  15. Chen, X., Wasikowski, M.: FAST: A ROC–based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD, pp. 124–133 (2008). https://doi.org/10.1145/1401890.1401910

  16. Cieslak, D., Chawla, N.: Learning decision trees for unbalanced data. In: D. et al. (ed.) Proceedings of the ECML PKDD 2008, Part I, LNAI, vol. 5211, pp. 241–256. Springer (2008). https://doi.org/10.1007/978-3-540-87479-9_34

  17. Ertekin, S., Huang, J., Bottou, L., Giles, C.: Learning on the border: Active learning in imbalanced data classification. In: Proceedings ACM Conference on Information and Knowledge Management, pp. 127–136 (2007). https://doi.org/10.1145/1321440.1321461

  18. Ertekin, S.: Adaptive oversampling for imbalanced data classification. Inf. Sci. Syst. 264, 261–269 (2013)

    Google Scholar 

  19. Ferdowsi, Z., Ghani, R., Settimi, R.: Online Active Learning with Imbalanced Classes. In: Proceedings IEEE 13th International Conference on Data Mining, pp. 1043–1048 (2013)

    Google Scholar 

  20. Fu, J., Lee, S.: Certainty-based Active Learning for Sampling Imbalanced Datasets. Neurocomputing 119, 350–358 (2013). https://doi.org/10.1016/j.neucom.2013.03.023

    Article  Google Scholar 

  21. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H.: Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C 99, 1–22 (2011)

    Google Scholar 

  22. Garcia, V., Sanchez, J., Mollineda, R.: An empirical study of the behaviour of classifiers on imbalanced and overlapped data sets. In: Proceedings of Progress in Pattern Recognition, Image Analysis and Applications, LNCS, vol. 4756, pp. 397–406. Springer (2007)

    Google Scholar 

  23. Grzymala-Busse, J., Stefanowski, J., Wilk, S.: A comparison of two approaches to data mining from imbalanced data. J. Intell. Manuf. 16, 565–574 (2005). https://doi.org/10.1007/s10845-005-4362-2

    Article  Google Scholar 

  24. He H. Yungian, M.: Imbalanced Learning. Foundations, Algorithms and Applications. IEEE - Wiley (2013)

    Google Scholar 

  25. He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Data Knowl. Eng. 21, 1263–1284 (2009). https://doi.org/10.1109/TKDE.2008.239

    Article  Google Scholar 

  26. Hido, S., Kashima, H.: Roughly balanced bagging for imbalance data. Stat. Anal. Data Min. 2(5–6), 412–426 (2009)

    Article  MathSciNet  Google Scholar 

  27. Ho, T.: The random subspace method for constructing decision forests. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)

    Article  Google Scholar 

  28. Hu, B., Dong, W.: A study on cost behaviors of binary classification measures in class-imbalanced problems. CoRR abs/1403.7100 (2014)

    Google Scholar 

  29. Japkowicz, N., Stephen, S.: Class imbalance problem: a systematic study. Intell. Data Anal. J. 6(5), 429–450 (2002)

    MATH  Google Scholar 

  30. Japkowicz, N.: Shah, Mohak: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press (2011). https://doi.org/10.1017/CBO9780511921803

  31. Jelonek, J., Stefanowski, J.: Feature subset selection for classification of histological images. Artif. Intell. Med. 9, 227–239 (1997). https://doi.org/10.1016/S0933-3657(96)00375-2

    Article  Google Scholar 

  32. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40–49 (2004). https://doi.org/10.1145/1007730.1007737

    Article  Google Scholar 

  33. Khoshgoftaar, T., Van Hulse, J., Napolitano, A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern. Part A 41(3), 552–568 (2011). https://doi.org/10.1109/TSMCA.2010.2084081

    Article  Google Scholar 

  34. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-side selection. In: Proceedings of the 14th International Conference on Machine Learning ICML-1997, pp. 179–186 (1997)

    Google Scholar 

  35. Kuncheva, L.: Combining Pattern Classifiers. Methods and Algorithms, 2nd edn. Wiley (2014)

    Google Scholar 

  36. Lango, M., Stefanowski, J.: The usefulness of roughly balanced bagging for complex and high-dimensional imbalanced data. In: Proceedings of International ECML PKDD Workshop on New Frontiers in Mining Complex Patterns NFmC, LNAI, vol. 9607, pp. 94–107, Springer (2015)

    Google Scholar 

  37. Lango, M., Stefanowski, J.: Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data. J. Intell. Inf. Syst. (to appear). https://doi.org/10.1007/s10844-017-0446-7

  38. Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. Tech. Rep. A-2001-2, University of Tampere (2001). https://doi.org/10.1007/3-540-48229-6_9

  39. Lewis, D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of 11th International Conference on Machine Learning, pp. 148–156 (1994)

    Google Scholar 

  40. Liu, A., Zhu, Z.: Ensemble methods for class imbalance learning. In: Y.M. He H. (ed.) Imbalanced Learning. Foundations, Algorithms and Applications, pp. 61–82. Wiley (2013). https://doi.org/10.1002/9781118646106.ch4

  41. Lopez, V., Fernandez, A., Garcia, S., Palade, V., Herrera, F.: An Insight into Classification with Imbalanced Data: Empirical Results and Current Trends on Using Data Intrinsic Characteristics. Inf. Sci. 257, 113–141 (2014)

    Article  Google Scholar 

  42. Napierała, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy and borderline examples. In: Proceedings of 7th International Conference RSCTC 2010, LNAI, vol. 6086, pp. 158–167. Springer (2010). https://doi.org/10.1007/978-3-642-13529-3_18

  43. Napierała, K., Stefanowski, J.: BRACID: A comprehensive approach to learning rules from imbalanced data. J. Intell. Inf. Syst. 39, 335–373 (2012). https://doi.org/10.1007/s10844-011-0193-0

    Article  Google Scholar 

  44. Napierała, K., Stefanowski, J.: Addressing imbalanced data with argument based rule learning. Expert Syst. Appl. 42, 9468–9481 (2015). https://doi.org/10.1016/j.eswa.2015.07.076

    Article  Google Scholar 

  45. Napierała, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. J. Intell. Inf. Syst. 46, 563–597 (2016). https://doi.org/10.1007/s10844-015-0368-1

    Article  Google Scholar 

  46. Napierała, K.: Improving rule classifiers for imbalanced data. Ph.D. thesis, Poznań University of Technology (2013)

    Google Scholar 

  47. Prati, R., Batista, G., Monard, M.: Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings 3rd Mexican International Conference on Artificial Intelligence, pp. 312–321 (2004)

    Google Scholar 

  48. Ramirez-Loaiza, M., Sharma, M., Kumar, G., Bilgic, M.: Active learning: An empirical study of common baselines. Data Min. Knowl. Discov. 31, 287–313 (2017). https://doi.org/10.1007/s10618-016-0469-7

    Article  MathSciNet  Google Scholar 

  49. Seaz, J., Krawczyk, B., Woźniak, M.: Analyzing the oversampling of different classes and types in multi-class imbalanced data. Pattern Recognit 57, 164–178 (2016). https://doi.org/10.1016/j.atcog.2016.03.012

    Article  Google Scholar 

  50. Settles, B.: Active learning literature survey. Tech. Rep. 1648, University of Wisconsin-Madison (2009)

    Google Scholar 

  51. Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th International Conference DaWaK. LNCS, vol. 5182, pp. 283–292. Springer (2008). https://doi.org/10.1007/978-3-540-85836-2_27

  52. Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: S. Ramanna, L.C. Jain, R.J. Howlett (eds.) Emerging Paradigms in Machine Learning, vol. 13, pp. 277–306. Springer (2013). https://doi.org/10.1007/978-3-642-28699-5_11

  53. Stefanowski, J.: Dealing with data difficulty factors while learning from imbalanced data. In: J. Mielniczuk, S. Matwin (eds.) Challenges in Computational Statistics and Data Mining, pp. 333–363. Springer (2016). https://doi.org/10.1007/978-3-319-18781-5_17

  54. Sun, Y., Wong, A., Kamel, M.: Classification of imbalanced data: a review. Int. J.Pattern Recognit Artif. Intell. 23(4), 687–719 (2009). https://doi.org/10.1142/S0218001409007326

    Article  Google Scholar 

  55. Tomasev, N., Mladenic, D.: Class imbalance and the curse of minority hubs. Knowl. Based Syst. 53, 157–172 (2013)

    Article  Google Scholar 

  56. Wang, S., Yao, X.: Mutliclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B 42(4), 1119–1130 (2012). https://doi.org/10.1109/TSMCB.2012.2187280

    Article  Google Scholar 

  57. Weiss, G.: Mining with rarity: A unifying framework. ACM SIGKDD Explor. Newsl. 6(1), 7–19 (2004). https://doi.org/10.1145/1007730.1007734

    Article  Google Scholar 

  58. Wojciechowski, S., Wilk, S.: Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data. Found. Comput. Decis. Sci. 42(2), 149–176 (2017)

    MATH  Google Scholar 

  59. Yang, Y., Ma, G.: Ensemble-based active learning for class imbalance problem. J. Biomed. Sci. Eng. 3(10), 1022–1029 (2010). https://doi.org/10.4236/jbise.2010.310133

    Article  Google Scholar 

  60. Ziȩba, M., Tomczak, J.: Boosted SVM with active learning strategy for imbalanced data. Soft Comput. 19(12), 3357–3368 (2015). https://doi.org/10.1007/s00500-014-1407-5

    Article  Google Scholar 

Download references

Acknowledgements

The research was supported from Poznań University of Technology Statutory Funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jerzy Błaszczyński .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Błaszczyński, J., Stefanowski, J. (2018). Improving Bagging Ensembles for Class Imbalanced Data by Active Learning. In: Stańczyk, U., Zielosko, B., Jain, L. (eds) Advances in Feature Selection for Data and Pattern Recognition. Intelligent Systems Reference Library, vol 138. Springer, Cham. https://doi.org/10.1007/978-3-319-67588-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67588-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67587-9

  • Online ISBN: 978-3-319-67588-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics