Skip to main content
Log in

A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Ensemble technique and under-sampling technique are both effective tools used for imbalanced dataset classification problems. In this paper, a novel ensemble method combining the advantages of both ensemble learning for biasing classifiers and a new under-sampling method is proposed. The under-sampling method is named Binary PSO instance selection; it gathers with ensemble classifiers to find the most suitable length and combination of the majority class samples to build a new dataset with minority class samples. The proposed method adopts multi-objective strategy, and contribution of this method is a notable improvement of the performances of imbalanced classification, and in the meantime guaranteeing a best integrity possible for the original dataset. We experimented the proposed method and compared its performance of processing imbalanced datasets with several other conventional basic ensemble methods. Experiment is also conducted on these imbalanced datasets using an improved version where ensemble classifiers are wrapped in the Binary PSO instance selection. According to experimental results, our proposed methods outperform single ensemble methods, state-of-the-art under-sampling methods, and also combinations of these methods with the traditional PSO instance selection algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215

    Article  Google Scholar 

  2. Muggleton SH, Bryant CH, Srinivasan A (2000) Measuring performance when positives are rare: relative advantage versus predictive accuracy—a biological case-study. In: European Conference on Machine Learning. Springer, Berlin, Heidelberg

  3. Lazarevic A et al (2003) A comparative study of anomaly detection schemes in network intrusion detection. SDM

  4. Fawcett T, Provost FJ (1996) Combining data mining and machine learning for effective user profiling. KDD

  5. Ezawa KJ, Singh M, Norton SW (1996) Learning goal oriented Bayesian networks for telecommunications risk management. ICML

  6. Li J et al (2015) Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomput 1–21

  7. Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6

    Article  Google Scholar 

  8. Japkowicz N (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets, vol 68

  9. Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol 11

  10. Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC

  11. Breiman L et al (1984) Classification and regression trees. CRC press

  12. Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  13. Qi F, Wang Z, Gao D (2016) One-sided dynamic undersampling no-propagation neural networks for imbalance problem. Eng Appl Artif Intell 53:62–73

    Article  Google Scholar 

  14. Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybernetics Part B (Cybernetics) 39(2): 539–550

  15. Drummond C, Holte RC (2000) Exploiting the cost (in) sensitivity of decision tree splitting criteria. ICML

  16. Quinlan JR (1996) Bagging, boosting, and C4. 5. AAAI/IAAI, vol 1

  17. Galar M et al (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybernetics Part C (Applications and Reviews) 42(4): 463–484

  18. Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. University of California, Berkeley, pp 1–12

    Google Scholar 

  19. Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Sixth International Conference on Data Mining (ICDM’06). IEEE

  20. Fan W et al (1999) AdaCost: misclassification cost-sensitive boosting. Icml

  21. Sun Y et al (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378

    Article  Google Scholar 

  22. Chawla NV et al (2003) SMOTEBoost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg

  23. Nayal A, Jomaa H, Awad M (2017) KerMinSVM for imbalanced datasets with a case study on arabic comics classification. Eng Appl Artif Intell 59:159–169

    Article  Google Scholar 

  24. Moayedikia A, Ong KL, Boo YL, Yeoh WG, Jensen R (2017) Feature selection for high dimensional imbalanced class data using harmony search. Eng Appl Artif Intell 57:38–49

    Article  Google Scholar 

  25. Hauxiang G, Yijing Li, Yanan Li, Xiao L, Jinling Li (2016) BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification. Eng Appl Artif Intell 49:176–193

    Article  Google Scholar 

  26. Liu P, Liu X, Liu B, Chen X (2021) A new over-sampling ensemble approach for imbalanced data. In: 2021 International Conference on Big Data Analysis and Computer Science (BDACS), Kunming, China, pp. 92–96. https://doi.org/10.1109/BDACS53596.2021.00028

  27. Drotár P, Gazda M, Vokorokos L (2019) Ensemble feature selection using election methods and ranker clustering. Inf Sci 480:365–380. https://doi.org/10.1016/j.ins.2018.12.033

    Article  MathSciNet  MATH  Google Scholar 

  28. Ren S, Zhu W, Liao B, Li Z, Wang P, Li K, Chen M, Li Z (2019) Selection-based resampling ensemble algorithm for nonstationary imbalanced stream data learning. Knowl Based Syst 163: 705–722. ISSN 0950-7051 https://doi.org/10.1016/j.knosys.2018.09.032.

  29. Shahabadi MS, Tabrizchi H, Rafsanjani MK, Gupta BB, Palmieri F (2021) A combination of clustering-based under-sampling with ensemble methods for solving imbalanced class problem in intelligent systems. Technol Forecast Social Change 169: 120796. ISSN 0040-625. https://doi.org/10.1016/j.techfore.2021.120796

  30. Hayashi T, Fujita H (2021) One-class ensemble classifier for data imbalance problems. Appl Intell. https://doi.org/10.1007/s10489-021-02671-1

    Article  Google Scholar 

  31. Kennedy J (2011) Particle swarm optimization. Encycl Mach Learn 760–766

  32. Holland JH (1992) Genetic algorithms. Sci Am 267:66–72

    Article  Google Scholar 

  33. Li J et al (2016) Solving the under-fitting problem for decision tree algorithms by incremental swarm optimization in rare-event healthcare classification. J Med Imag Health Inform 6(4):1102–1110

    Article  Google Scholar 

  34. Abido MA (2002) Optimal design of power-system stabilizers using particle swarm optimization. IEEE Trans Energy Convers 17(3):406–413

    Article  Google Scholar 

  35. Li J et al (2017) A suite of swarm dynamic multi-objective algorithms for rebalancing extremely imbalanced datasets. Appl Soft Comput. https://doi.org/10.1016/j.asoc.2017.11.028

    Article  Google Scholar 

  36. Fong S et al (2014) Metaheuristic swarm search for feature selection in life science classification. IEEE IT Prof Mag 16(4):24–29

    Article  Google Scholar 

  37. Li J, Fong S, Zhuang Y (2015) Optimizing SMOTE by metaheuristics with neural network and decision tree. In: 3rd International Symposium on Computational and Business Intelligence (ISCBI). IEEE.

  38. Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, vol 5. IEEE

  39. Li J, Fong S, Wong RK, Millham R, Wong KK (2017) Elitist binary wolf search algorithm for heuristic feature selection in high-dimensional bioinformatics datasets. Sci Rep. https://doi.org/10.1038/s41598-017-04037-5

    Article  Google Scholar 

  40. Li J et al (2016) Adaptive multi-objective swarm crossover optimization for imbalanced data classification. In: Advanced Data Mining and Applications: 12th International Conference, ADMA 2016. Gold Coast, QLD, Australia, Springer

  41. Khalesian M, Delavar MR (2016) Wireless sensors deployment optimization using a constrained Pareto-based multi-objective evolutionary approach. Eng Appl Artif Intell 53:126–139

    Article  Google Scholar 

  42. Alcalá J et al (2010) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17(255–287):11

    Google Scholar 

  43. Li J et al (2016) Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. J BioData Min. https://doi.org/10.1186/s13040-016-0117-1

    Article  Google Scholar 

  44. Li J et al (2018) Adaptive multi-objective swarm fusion for imbalanced data classification. Inf Fusion 39:1–24

    Article  Google Scholar 

  45. Seiffert C et al (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybernetics Part A Syst Hum 40(1): 185–197

Download references

Acknowledgements

The authors are grateful for the financial support from the Research Grants, (1) Nature-Inspired Computing and Meta-heuristics Algorithms for Optimizing Data Mining Performance, Grant no. MYRG2016-00069-FST, offered by the University of Macau, FST, and RDAO; and (2) A Scalable Data Stream Mining Methodology: Stream-based Holistic Analytics and Reasoning in Parallel, Grant no. FDCT/126/2014/A3, offered by FDCT Macau.

Funding

This study was funded by (1) Nature-Inspired Computing and Meta-heuristics Algorithms for Optimizing Data Mining Performance, Grant no. MYRG2016-00069-FST; and (2) A Scalable Data Stream Mining Methodology: Stream-based Holistic Analytics and Reasoning in Parallel, Grant no. FDCT/126/2014/A3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaoyang Wu.

Ethics declarations

Conflict of interest

Author Jingyan Li declares that he has no conflict of interest. Author Yaoyang WU declares that she has no conflict of interest. Author Simon Fong declares that he has no conflict of interest. Author Antonio J. Tallón-Ballesteros declares that he has no conflict of interest. Author Xin-she Yang declares that he has no conflict of interest. Author Sabah Mohammed declares that he has no conflict of interest. Author Feng WU declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Appendix Tables

Table 6 Kappa statistic results of different datasets with different methods in experiment 1

6,

Table 7 Accuracy results of different datasets with different methods in experiment 1

7,

Table 8 Kappa statistic results of different datasets with different methods in experiment 2

8,

Table 9 Accuracy results of different datasets with different methods in experiment 2

9,

Table 10 Descriptions of experimental methods for Experiment 1

10.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Wu, Y., Fong, S. et al. A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data. J Supercomput 78, 7428–7463 (2022). https://doi.org/10.1007/s11227-021-04177-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04177-6

Keywords

Navigation