Skip to main content
Log in

A bi-objective hybrid algorithm for the classification of imbalanced noisy and borderline data sets

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Classification of imbalanced data sets is one of the significant problems of machine learning and data mining. Traditional classifiers usually produced suboptimal results for imbalanced data sets. This study proposed an idea of using a newly proposed bi-objective hybrid algorithm for the given classification task of binary imbalanced noisy and borderline data sets. The bi-objective hybrid algorithm was based on the hybridization of two metaheuristics, namely cuckoo search and covariance matrix adaptation evolution strategy. The validation of this proposed hybrid algorithm was confirmed in terms of the Pareto fronts. Thereafter, this algorithm was used in a methodology proposed for the classification task of the binary imbalanced data sets. The proposed methodology was based on an idea of estimating the probabilities from both classes (majority and minority) of a data set, using normal distribution. Optimization of parameters of the normal distribution was done with the help of the proposed algorithm. Different data sets (simulated, noisy borderline and real) were used. Four well-known classifiers with a preprocessing algorithm were cast-off for the comparison purpose. Performances of all classifiers were evaluated using three evaluation measures, sensitivity, G mean and F measure. A promising performance of proposed methodology was observed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Alcala-Fdez J, Fernndez A, Luengo J et al (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17:255–287. https://doi.org/10.1007/s00500-008-0323-y

    Google Scholar 

  2. Al-Shahib A, Breitling R, Gilbert D (2005) Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinform 4:195–203. https://doi.org/10.2165/00822942-200594030-00004

    Article  Google Scholar 

  3. Bach M, Werner A, Zywiec J, Pluskiewicz W (2017) The study of under- and over-sampling methods utility in analysis of highly imbalanced data on osteoporosis. Inf Sci 384:174–190. https://doi.org/10.1016/j.ins.2016.09.038

    Article  Google Scholar 

  4. Barandela R, Sanchez JS, Garcia V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recog 36:849–851. https://doi.org/10.1016/S0031-3203(02)00257-1

    Article  Google Scholar 

  5. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl Spec Issue Learn Imbalanc Datasets 6:20–29. https://doi.org/10.1145/1007730.1007735

    Article  Google Scholar 

  6. Beckmann M, de Lima BSLP, Ebecken NFF (2011) Genetic algorithms as a pre processing strategy for imbalanced datasets. In: Proceedings of the 13th annual conference companion on genetic and evolutionary computation—GECCO 11 131. https://doi.org/10.1145/2001858.2001933

  7. Bekkar M, Djemaa HK, Alitouche TA (2013) Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl 3:27–38

    Google Scholar 

  8. Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn 48:1653–1672. https://doi.org/10.1016/j.patcog.2014.10.032

    Article  Google Scholar 

  9. Boonchuay K, Sinapiromsaran K, Lursinsap C (2016) Decision tree induction based on minority entropy for the class imbalance problem. Pattern Anal Appl. https://doi.org/10.1007/s10044-016-0533-3

    Google Scholar 

  10. Cao VL, Le-Khac NA, O’Neill, M et al (2016) Improving fitness functions in genetic programming for classification on unbalanced credit card data. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 9597, pp 35–45. https://doi.org/10.1007/978-3-319-31204-0_3

  11. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  12. Chawla NV, Japkowicz N, Drive P (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6:1–6. https://doi.org/10.1145/1007730.1007733

    Article  Google Scholar 

  13. Chawla NV (2009) Data Mining for Imbalanced Datasets: An Overview. Data Min Knowl Discov Handb. https://doi.org/10.1007/978-0-387-09823-4_45

    Google Scholar 

  14. Cheng F, Zhang J, Wen C et al (2017) Large cost-sensitive margin distribution machine for imbalanced data classification. Neurocomputing 224:45–57. https://doi.org/10.1016/j.neucom.2016.10.053

    Article  Google Scholar 

  15. Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. University of California, Berkeley, p 112. https://ley.edu/sites/default/files/tech-reports/666.pdf

  16. Coello CAC, Lamont GB, Van Veldhuizen DA (2007) Evolutionary algorithms for solving multi-objective problems second edition. Design. https://doi.org/10.1007/978-0-387-36797-2

  17. Deb K (2001) Multi-objective optimization using evolutionary algorithms. Wiley, London, p 497. https://doi.org/10.1109/TEVC.2002.804322

    MATH  Google Scholar 

  18. Demar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. https://doi.org/10.1016/j.jecp.2010.03.005

    MathSciNet  Google Scholar 

  19. Ducange P, Lazzerini B, Marcelloni F (2010) Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets. Soft Comput 14:713–728. https://doi.org/10.1007/s00500-009-0460-y

    Article  Google Scholar 

  20. Duval B, Hao JK (2009) Advances in metaheuristics for gene selection and classification of microarray data. Brief Bioinform 11:127–141. https://doi.org/10.1093/bib/bbp035

    Article  Google Scholar 

  21. Fernandez A, Garcia S, Herrera F, Del Jesus MJ (2007) An analysis of the rule weights and fuzzy reasoning methods for linguistic rule based classification systems applied to problems with highly imbalanced data sets. In: Applications of fuzzy sets theory. WILF 2007. Lecture notes in computer science, vol 4578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73400-0_21

  22. Fernandez A, Garcia S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159:2378–2398. https://doi.org/10.1016/j.fss.2007.12.023

    Article  MathSciNet  Google Scholar 

  23. Fernandez A, Lopez V, Galar M et al (2013) Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl-Based Syst 42:97–110. https://doi.org/10.1016/j.knosys.2013.01.018

    Article  Google Scholar 

  24. Fister I Jr, Fister D, Fistar I (2013) A comprehensive review of Cuckoo search: variants and hybrids. Int J Math Model Numer Optim 4:387–409. https://doi.org/10.1504/IJMMNO.2013.059205

    MATH  Google Scholar 

  25. Galar M, Fernandez A, Barrenechea E et al (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42:463–484. https://doi.org/10.1109/TSMCC.2011.2161285

    Article  Google Scholar 

  26. Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2:42–47

    Google Scholar 

  27. Garcia LPF, Lorena AC, Carvalho ACPLF (2012) A study on class noise detection and elimination. Proc Br Symp Neural Netw SBRN. https://doi.org/10.1109/SBRN.2012.49

    Google Scholar 

  28. Garcia S, Fernndez A, Bentez AD, Herrera F (2007) Statistical comparisons by means of non-parametric tests: a case study on genetic based machine learning. In: Proceedings of the II Congreso Espaol de Informtica (CEDI 2007) V Taller Nacional de Minera de Datos y Aprendizaje (TAMIDA 2007), pp 95–104

  29. Garcia V, Mollineda RA, Sanchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11:269280. https://doi.org/10.1007/s10044-007-0087-5

    Article  MathSciNet  Google Scholar 

  30. Garcia V, Snchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25:1321. https://doi.org/10.1016/j.knosys.2011.06.013

    Google Scholar 

  31. Guo X, Yin Y, Dong C et al (2008) On the class imbalance problem. In: Proceedings—4th international conference on natural computation, ICNC, vol 4, pp. 192–201. https://doi.org/10.1109/ICNC.2008.871

  32. Graczyk M, Lasota T, Telec Z, Trawiski B (2012) Nonparametric statistical analysis of machine learning algorithms for regression problems. Int J Appl Math Comput Sci 22:867–881

    Article  MathSciNet  Google Scholar 

  33. Hansen N (2016) The CMA evolution strategy. A tutorial. 102:75–102. https://doi.org/10.1007/11007937_4

  34. Hansen N, Kern S (2004) Evaluating the CMA evolution strategy on multimodal test functions, pp 282–291. https://doi.org/10.1007/978-3-540-30217-9_29

  35. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc Int Jt Conf Neural Netw. https://doi.org/10.1109/IJCNN.2008.4633969

    Google Scholar 

  36. He M, Wu T, Silva A et al (2015) Augmenting cost-SVM with gaussian mixture models for imbalanced classification. Artif Intell Res 4:93–105. https://doi.org/10.5430/air.v4n2p93

    Google Scholar 

  37. Kumar MNA, Sheshadri SH (2012) On the classification of imbalanced datasets. Int J Comput Appl 44:17. https://doi.org/10.5120/6280-8449

    Google Scholar 

  38. Li J, Fong S, Wong RK, Chu VW (2018) Adaptive multi-objective swarm fusion for imbalanced data classification. Inf Fus 39:1–24. https://doi.org/10.1016/j.inffus.2017.03.007

    Article  Google Scholar 

  39. Longadge R, Dongre SS, Malik L (2013) Class imbalance problem in data mining: review. Int J Comput Sci Netw 2:83–87. https://doi.org/10.1109/SIU.2013.6531574

    Google Scholar 

  40. Lopez V, Fndez A, del Jesus MJ, Herrera F (2013) A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowl-Based Syst 38:85–104. https://doi.org/10.1016/j.knosys.2012.08.025

    Article  Google Scholar 

  41. Maheta HH, Dabhi VK (2015) Classification of imbalanced data sets using multi objective genetic programming. In: 5th international conference on computer communication and informatics, ICCCI 2015. https://doi.org/10.1109/ICCCI.2015.7218125

  42. Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246. https://doi.org/10.1016/j.ins.2014.07.015

    Article  Google Scholar 

  43. Maragoudakis M, Kermanidis K, Garbis A, Fakotakis N (2000) Dealing with imbalanced data using Bayesian techniques. In: International conference on language resources and evaluation, pp 1045–1050

  44. Marler RT, Arora JS (2010) The weighted sum method for multi-objective optimization: new insights. Struct Multidiscip Optim 41:853–862. https://doi.org/10.1007/s00158-009-0460-7

    Article  MathSciNet  MATH  Google Scholar 

  45. Micheal R (2013) On the multivariate T distribution. Technical report from Automatic Control at Linkping s Universitet

  46. Milare C, Batista G, Carvalho A (2011) A hybrid approach to learn with imbalanced classes using evolutionary algorithms. Log J IGPL 19:293–303

    Article  MathSciNet  Google Scholar 

  47. Moreno-Torres JG, Llor X, Goldberg DE, Bhargava R (2013) Repairing fractures between data using genetic programming-based feature extraction: a case study in cancer diagnosis. Inf Sci 222:805–823. https://doi.org/10.1016/j.ins.2010.09.018

    Article  Google Scholar 

  48. Naidu K, Mokhlis H, Bakar A (2014) Multiobjective optimization using weighted sum artificial bee colony algorithm for load frequency control. Int J Electr Power Energy Syst 55:657–667

    Article  Google Scholar 

  49. Napierala K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 6086 LNAI, pp 158–167. https://doi.org/10.1007/978-3-642-13529-3_18

  50. Nie F, Huang Y, Wang X, Huang H (2014) New primal SVM solver with linear computational cost for big data classifications. In: Proceedings of 31st international conference on machine learning. JMLR: W & Cp 32, Beijing

  51. Nie F, Wang X, Huang H (2017) Multiclass capped LP-norm SVM for robust classification. In: Proceedings of the 31st AAAI conference on artificial intelligence (AAAI-17)

  52. Nguyen GH, Bouzerdoum A, Phung SL (2009) Learning pattern classification tasks with imbalanced data sets. Pattern Recogn. https://doi.org/10.5772/7544

    Google Scholar 

  53. Orriols-Puig A, Bernad-Mansilla E (2009) Evolutionary rule-based systems for imbalanced data sets. Soft Comput 13:213–225. https://doi.org/10.1007/s00500-008-0319-7

    Article  Google Scholar 

  54. Pohlert T (2014) The pairwise multiple comparison of mean ranks package (PMCMR). R package 27. http://cran.ms.unimelb.edu.au/web/packages/PMCMR/vignettes/PMCMR.pdf

  55. Rahman A, Ahmed AM (2016) Multi-objective optimization indices. A comparative. Analysis 8:112

    Google Scholar 

  56. Rivera WA, Xanthopoulos P (2016) A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets. Expert Syst Appl 66:124135. https://doi.org/10.1016/j.eswa.2016.09.010

    Article  Google Scholar 

  57. Singh D (2013) A study on the use of non-parametric tests for experimentation with cluster analysis. Int J Eng Manag Res 3:64–72

    Google Scholar 

  58. Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40:3358–3378. https://doi.org/10.1016/j.patcog.2007.04.009

    Article  MATH  Google Scholar 

  59. Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. Proc Int Jt Conf Neural Netw. https://doi.org/10.1109/IJCNN.2010.5596486

    Google Scholar 

  60. Trawinski B, Smtek M, Telec Z, Lasota T (2012) Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms. Int J Appl Math Comput Sci. https://doi.org/10.2478/v10006-012-0064-z

    MathSciNet  MATH  Google Scholar 

  61. Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning—ICML 07 935942. https://doi.org/10.1145/1273496.1273614

  62. Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68:1513–1542. https://doi.org/10.1016/j.datak.2009.08.005

    Article  Google Scholar 

  63. Vluymans S, Triguero I, Cornelis C, Saeys Y (2016) EPRENNID: an evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data. Neurocomputing 216:596–610. https://doi.org/10.1016/j.neucom.2016.08.026

    Article  Google Scholar 

  64. Weiss GM, Weiss GM (2015) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 61(6):7–19

    Google Scholar 

  65. Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques. Ann Phys. https://doi.org/10.1002/1521-3773(20010316)40:6%3c9823::AID-ANIE9823%3e3.3.CO;2-C

    Google Scholar 

  66. Yang P, Xu L, Zhou BB et al (2009) A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genom 10(Suppl 3):S34. https://doi.org/10.1186/1471-2164-10-S3-S34

    Article  Google Scholar 

  67. Yang X, Chien SF, Ting TO et al (2014) Computational intelligence and metaheuristic algorithms with applications. Sci World J 2014:14. https://doi.org/10.1155/2014/425853

    Google Scholar 

  68. Yang XS (2011) Bat algorithm for multi-objective optimization. Int J Bioinspir Comput 5:267–274

    Article  Google Scholar 

  69. Yang X-S (2013) Multiobjective firefly algorithm for continuous. Optimization 29:175–184. https://doi.org/10.1007/s00366-012-0254-1

    Google Scholar 

  70. Yang XS, Deb S (2013) Multiobjective cuckoo search for design optimization. Comput Oper Res 40:1616–1624. https://doi.org/10.1016/j.cor.2011.09.026

    Article  MathSciNet  MATH  Google Scholar 

  71. Yang XS, Deb S (2014) Cuckoo search: recent advances and applications. Neural Comput Appl 24:169–174. https://doi.org/10.1007/s00521-013-1367-1

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sana Saeed.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saeed, S., Ong, H.C. A bi-objective hybrid algorithm for the classification of imbalanced noisy and borderline data sets. Pattern Anal Applic 22, 979–998 (2019). https://doi.org/10.1007/s10044-018-0693-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-018-0693-4

Keywords

Navigation