Evolutionary under-sampling based bagging ensemble method for imbalanced data classification

Sun, Bo; Chen, Haiyan; Wang, Jiandong; Xie, Hua

doi:10.1007/s11704-016-5306-z

Evolutionary under-sampling based bagging ensemble method for imbalanced data classification

Research Article
Published: 23 March 2018

Volume 12, pages 331–350, (2018)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Bo Sun^1,2,
Haiyan Chen^1,2,
Jiandong Wang¹ &
…
Hua Xie²

435 Accesses
1 Altmetric
Explore all metrics

Abstract

In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve this problem, many effective approaches have been proposed. Among them, the bagging ensemble methods with integration of the under-sampling techniques have demonstrated better performance than some other ones including the bagging ensemble methods integrated with the over-sampling techniques, the cost-sensitive methods, etc. Although these under-sampling techniques promote the diversity among the generated base classifiers with the help of random partition or sampling for the majority class, they do not take any measure to ensure the individual classification performance, consequently affecting the achievability of better ensemble performance. On the other hand, evolutionary under-sampling EUS as a novel undersampling technique has been successfully applied in searching for the best majority class subset for training a good-performance nearest neighbor classifier. Inspired by EUS, in this paper, we try to introduce it into the under-sampling bagging framework and propose an EUS based bagging ensemble method EUS-Bag by designing a new fitness function considering three factors to make EUS better suited to the framework. With our fitness function, EUS-Bag could generate a set of accurate and diverse base classifiers. To verify the effectiveness of EUS-Bag, we conduct a series of comparison experiments on 22 two-class imbalanced classification problems. Experimental results measured using recall, geometric mean and AUC all demonstrate its superior performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid Sampling with Bagging for Class Imbalance Learning

A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data

Article 11 November 2021

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data

Article Open access 02 March 2024

References

Banfield R E, Hall L O, Bowyer K W, Kegelmeyer WP. A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1): 173–180
Article Google Scholar
Donate J P, Cortez P, Sanchez G G, Miguel A S. Time series forecasting using a weighted cross-validation evolutionary artificial neural network ensemble. Neurocomputing, 2013, 109(1): 27–32
Article Google Scholar
Niu D X, Wang Y L, Wu D D. Power load forecasting using support vector machine and ant colony optimization. Expert Systems with Applications, 2010, 37(3): 2531–2539
Article Google Scholar
Rutkowski L, Jaworski M, Pietruczuk L, Duda P. The CART decision tree for mining data streams. Information Sciences, 2014, 266: 1–15
Article MATH Google Scholar
Bar-Hen A, Gey S, Poggi J M. Influence measures for CART classification trees. Journal of Classification, 2015, 32(1): 21–45
Article MathSciNet MATH Google Scholar
Mazurowski M A, Habas P A, Zurada J M, Lo J Y, Baker J A, Tourassi G D. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Networks, 2008, 21(2): 427–436
Article Google Scholar
Tomczak J M, Zieba M. Probabilistic combination of classification rules and its application to medical diagnosis. Machine Learning, 2015, 101(1–3): 105–135
Article MathSciNet MATH Google Scholar
Tavallaee M, Stakhanova N, Ghorbani A A. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2010, 40(5): 516–524
Article Google Scholar
Ngai EWT, Hu Y, Wong Y H, Chen Y J, Sun X. The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature. Decision Support Systems, 2011, 50(3): 559–569
Article Google Scholar
Chang X J, Yu Y L, Yang Y, Hauptmann A G. Searching persuasively: joint event detection and evidence justification with limited supervision. In: Proceedings of the 23rd Annual ACM Conference on Multimedia. 2015, 581–590
Google Scholar
Chang X J, Yang Y, Xing E P, Yu Y L. Complex event detection using semantic saliency and nearly-isotonic SVM. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 1348–1357
Google Scholar
Chang X J, Yang Y, Hauptmann A G, Xing E P. Semantic concept discovery for large-scale zero-shot event detection. In: Proceedings of the 4th International Joint Conference on Artificial Intelligence. 2015
Google Scholar
Bermejo P, Gámez J A, Puerta J M. Improving the performance of naive bayes multinomial in e-mail foldering by introducing distributionbased balance of datasets. Expert Systems with Applications, 2011, 38(3): 2072–2080
Article Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463–484
Article Google Scholar
Nanni L, Fantozzi C, Lazzarini N. Coupling different methods for overcoming the class imbalance problem. Neurocomputing, 2015, 158(1): 48–61
Article Google Scholar
Batista G E, Prati R C, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20–29
Article Google Scholar
Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321–357
Article MATH Google Scholar
Sáez J A, Luengo J, Stefanowski J, Herrera F. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 2015, 291(1): 184–203
Article Google Scholar
Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 2004, 20(1): 18–36
Article MathSciNet Google Scholar
He H B, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263–1284
Article Google Scholar
Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the International Conference on Machine Learning, Workshop on Learning from Imbalanced Datasets II. 2003, 1–8
Google Scholar
Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. In: Proceedings of International Conference on Intelligent Computing. 2005, 878–887
Google Scholar
Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine learning, 2002, 46(1–3): 191–202
Article MATH Google Scholar
Wu G, Chang E Y. KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(6): 786–795
Article Google Scholar
Barandela R, Sánchez J S, Garcia V, Rangel E. Strategies for learning in class imbalance problems. Pattern Recognition, 2003, 36(3): 849–851
Article Google Scholar
Ling C X, Sheng V S, Yang Q. Test strategies for cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(8): 1055–1067
Article Google Scholar
Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63–77
Article MathSciNet Google Scholar
Chawla N V, Cieslak D A, Hall L O, Joshi A. Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery, 2008, 17(2): 225–252
Article MathSciNet Google Scholar
Tao D C, Tang X O, Li X L, Wu X D. Asymmetric bagging and random subspace for support vector machines-based relevance feedback. IEEE Transactions on Pattern Analysis andMachine Intelligence, 2006, 28(7): 1088–1099
Article Google Scholar
Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining. 2009, 324–331
Google Scholar
Hido S, Kashima H, Takahashi Y. Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining, 2009, 2(5–6): 412–426
Article MathSciNet Google Scholar
Liu X Y, Wu J X, Zhou Z H. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2009, 39(2): 539–550
Article Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 2010, 40(1): 185–197
Article Google Scholar
Barandela R, Valdovinos R M, Sánchez J S. New applications of ensembles of classifiers. Pattern Analysis and Applications, 2003, 6(3): 245–256
Article MathSciNet Google Scholar
Khoshgoftaar T M, Van Hulse J, Napolitano A. Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 41(3): 552–568
Google Scholar
Chawla N V, Lazarevic A, Hall L O, Bowyer K W. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119
Google Scholar
Zhou Z H. Ensemble Methods: Foundations and Algorithms. Florida: CRC Press, 2012
Book Google Scholar
Sun B, Chen H Y, Wang J D. An empirical margin explanation for the effectiveness of DECORATE ensemble learning algorithm. Knowledge-Based Systems, 2015, 78(1): 1–12
Article Google Scholar
Hsu KW, Srivastava J. Improving bagging performance through multialgorithm ensembles. Frontiers of Computer Science, 2012, 6(5): 498–512
MATH Google Scholar
Liu E, Zhao H, Guo F F, Liang J M, Tian J. Fingerprint segmentation based on an AdaBoost classifier. Frontiers of Computer Science, 2011, 5(2): 148–157
Article MathSciNet Google Scholar
Yan Y, Xu Z W, Tsang I W, Long G, Yang Y. Robust semi-supervised learning through label aggregation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 1–7
Google Scholar
Rong W G, Peng B L, Ouyang Y X, Li C, Xiong Z. Structural information aware deep semi-supervised recurrent neural network for sentiment analysis. Frontiers of Computer Science, 2015, 9(2): 171–184
Article MathSciNet Google Scholar
Zhou Z H. When semi-supervised learning meets ensemble learning. Frontiers of Electrical and Electronic Engineering, 2011, 6(1): 6–16
Article Google Scholar
Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123–140
MATH MathSciNet Google Scholar
Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997, 55(1): 119–139
Article MathSciNet MATH Google Scholar
Garcia S, Herrera F. Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evolutionary Computation, 2009, 17(3): 275–306
Article MathSciNet Google Scholar
Garcia S, Derrac J, Cano J, Herrera F. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 417–435
Article Google Scholar
Luengo J, Fernández A, Garica S, Herrera F. Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Computing, 2011, 15(10): 1909–1936
Article Google Scholar
Drown D J, Khoshgoftaar T M, Seliya N. Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Transactions on Systems, Man and Cybernetics: PART A–Systems and Humans, 2009, 39(5): 1097–1107
Article Google Scholar
Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46(12): 3460–3471
Article Google Scholar
Fawcett T. ROC graphs: notes and practical considerations for researchers. Machine Learning, 2004, 31(1): 1–38
MathSciNet Google Scholar
Kuncheva L I, Whitaker C J. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 2003, 51(2): 181–207
Article MATH Google Scholar
Dietterich T G. Ensemble Learning. Cambridge: The MIT Press, 2002
MATH Google Scholar
Banfield R E, Hall L O, Bowyer K W, Kegelmeyer W P. Ensemble diversity measures and their application to thinning. Information Fusion, 2005, 6(1): 49–62
Article Google Scholar
Man K F, Tang K S, Kwong S. Genetic Algorithms: Concepts and Designs. Berlin: Springer Science & Business Media, 2012
MATH Google Scholar
Sun Z B, Song Q B, Zhu X Y, Sun H L, Xu B W, Zhou Y M. A novel ensemble method for classifying imbalanced data. Pattern Recognition, 2015, 48(5): 1623–1637
Article Google Scholar
He H B, Ma Y Q. Imbalanced Learning: Foundations, Algorithms, and Applications. New Jersey: John Wiley & Sons, 2013
Book MATH Google Scholar
Demšar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 2006, 7(1): 1–30
MathSciNet MATH Google Scholar

Download references

Acknowledgements

We would like to express our gratitude to both the associate editor and the anonymous reviewers for their constructive comments that improved the quality of our manuscript to a large extent. This work was supported by the National Natural Science Foundation of China (Grant No.61501229) and the Fundamental Research Funds for the Central Universities (NS2015091, NS2014067, NJ20160013).

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
Bo Sun, Haiyan Chen & Jiandong Wang
National Key Lab of ATFM, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
Bo Sun, Haiyan Chen & Hua Xie

Authors

Bo Sun
View author publications
You can also search for this author inPubMed Google Scholar
Haiyan Chen
View author publications
You can also search for this author inPubMed Google Scholar
Jiandong Wang
View author publications
You can also search for this author inPubMed Google Scholar
Hua Xie
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Bo Sun or Haiyan Chen.

Additional information

Bo Sun is a PhD candidate in College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics, China. He received the BS degree in computer science from Liaocheng University, China in 2009, the MS degree in computer science from Jiangsu University, China in 2012. His research interests include ensemble learning and data mining.

Haiyan Chen is a lecturer in College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics (NUAA), China. She received her BS and PhD degrees in computer science from NUAA in 2003 and 2012, respectively. Her research interests include machine learning, data mining, and air traffic flow management.

Jiandong Wang is a professor and doctoral students adviser in College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics, China. He graduated in electrical engineering from Shanghai Jiao Tong University, China in 1967 and was a visiting scholar at the University of Ottawa, Canada from 1990 to 1991. Professor Wang’s research interests include artificial intelligence, data mining, and information security.

Hua Xie is a lecturer in College of Civil Aviation at Nanjing University of Aeronautics and Astronautics (NUAA), China. He received his BS and MS degrees in computer science and the PhD degree in system engineering from NUAA in 1999, 2005 and 2015, respectively. His research interests include air traffic flow management and security technology.

Electronic supplementary material

Supplementary material, approximately 350 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, B., Chen, H., Wang, J. et al. Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front. Comput. Sci. 12, 331–350 (2018). https://doi.org/10.1007/s11704-016-5306-z

Download citation

Received: 21 July 2015
Accepted: 17 June 2016
Published: 23 March 2018
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11704-016-5306-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evolutionary under-sampling based bagging ensemble method for imbalanced data classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hybrid Sampling with Bagging for Class Imbalance Learning

A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Electronic supplementary material

Supplementary material, approximately 350 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now