Skip to main content
Log in

A hybrid approach for classification of rare class data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Learning of rare class data is a challenging problem in field of classification process. A rare class or imbalanced class learning is the common problem faced by many real-world applications, because of this many researcher work focused on this issue. Rare class data always generate wrong results because of overwhelming accuracy of minority class by majority class. There are lots of methods been proposed to handle imbalanced class or rare class or skew class problem. This paper proposes a hybrid method, i. e. classification- and clustering-based method, solving rare class problem. This proposed hybrid method uses k-means, ensemble and divide and merge methods. This method tries to improve detection rate of every class. For experimental work, the proposed method is tested on real datasets. The experimental results show that proposed method works well as compared with other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Gudadhe M, Prakash P, Wankhade K (2010) A new data mining based network intrusion detection model. In: The proceedings of international conference on computer and communication technology (IEEE), Allahabad, India, pp 731–735

  2. Medioni G, Cohen I, Brémond F, Hongeng S, Nevatia R (2001) Event detection and analysis from video streams. IEEE Trans Pattern Anal Mach Intell 2001 23(8):873–889

    Article  Google Scholar 

  3. Zhong H, Shi J, Visontai M (2004) Detecting unusual activity in video. In: The proceeding of the IEEE computer society conference on computer vision and pattern recognition (CVPR’04), 2004, Washington, DC, 2:819–826

  4. Huda S, Yearwood J, Jelinek HF, Hassan MM, Fortino G, Buckland M (2016) A hybrid feature selection with ensemble classification for imbalanced healthcare data: a case study for brain tumor diagnosis. IEEE Access 4:9145–9154

    Article  Google Scholar 

  5. Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: The proceedings of 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’94), Dublin, Ireland, pp 3–12

  6. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. J Mach Learn 30(2):195–215

    Article  Google Scholar 

  7. Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor Newslett 6(1):50–59 (Special Issue on Learning from Imbalanced Datasets)

    Article  Google Scholar 

  8. Sit WY, Mao KZ (2013) Learning imbalanced classes in the presence of concept growth. In: The proceeding of IEEE conference on evolving and adaptive intelligent systems (EAIS), 2013, pp 62–69

  9. Lin SC, Chang CYI, Yang WN (2009) Meta-learning for imbalanced data and classification ensemble in binary classification. J Neurocomput 73(1–3):484–494

    Article  Google Scholar 

  10. Khoshgoftaar TM, Seiffert C, Hulse JV, Napolitano A, Folleco A (2007) Learning with limited minority class data. In: The proceeding of 6th international conference on machine learning and applications (IEEE), pp 348–353

  11. Wang S, Yao X (2013) Relationships between diversity of classification ensembles and single-class performance measures. IEEE Trans Knowl Data Eng 25(1):206–219

    Article  Google Scholar 

  12. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  13. Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2007) Mining data with rare events: a case study. In: The proceeding of the 19th IEEE international conference on tools with artificial intelligence, pp 132–139

  14. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging–boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern-Part C: Appl Rev 42(4):463–484

    Article  Google Scholar 

  15. Krawczyk B, Schaefer G, Wozniak M (2013) An evaluation of classifier ensembles for class imbalance problems. In: The proceeding of international conference on informatics, electronics and vision (ICIEV-IEEE), pp 1–4

  16. Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern-Part B: Cybern 42(4):1119–1130

    Article  Google Scholar 

  17. Liu N, Woon WL, Aung Z, Afshari A (2014) Handling class imbalance in customer behavior prediction. In: The proceedings of international conference on collaboration technologies and systems (CTS-IEEE), pp 100–103

  18. Yang Z, Tang W, Shintemirov A, Wu Q (2009) Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers. IEEE Trans Syst Man Cybern—Part C Appl Rev 39(6):597–610

    Article  Google Scholar 

  19. Zhu ZB, Song ZH (2010) Fault diagnosis based on imbalance modified kernel fisher discriminant analysis. J Chem Eng Res Des 88(8):936–951

    Article  Google Scholar 

  20. Khreich W, Granger E, Miri A, Sabourin R (2010) Iterative boolean combination of classifiers in the roc space: an application to anomaly detection with HMMs. J Pattern Recognit 43(8):2732–2752

    Article  MATH  Google Scholar 

  21. Tavallaee M, Stakhanova N, Ghorbani A (2010) Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Trans Syst Cybern: Part C Appl Rev 40(5):516–524

    Google Scholar 

  22. Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436

    Article  Google Scholar 

  23. del Castillo MD, Serrano JI (2004) A multi strategy approach for digital text categorization from imbalanced documents. ACM SIGKDD Explor Newslett 6(1):70–79 (Special Issue on Learning from Imbalanced Datasets)

    Article  Google Scholar 

  24. Turney PD (2000) Learning algorithms for key phrase extraction. J Inf Retr 2(4):303–336

    Article  Google Scholar 

  25. Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: The proceedings of 4th international conference on knowledge discovery and data mining (KDD), pp 73–79

  26. Bermejo P, Gamez JA, Puerta JM (2011) Improving the performance of naive bayes multinomial in e-mail foldering by introducing distribution based balance of datasets. J Expert Syst Appl 38(3):2072–2080

    Article  Google Scholar 

  27. Liu YH, Chen YT (2005) Total margin-based adaptive fuzzy support vector machines for multiview face recognition. In: The proceeding IEEE international conference on system, man and cybernetics 2:1704–1711

  28. Breiman L (1996) Bagging predictors. J Mach Learn 24(2):123–140

    MATH  Google Scholar 

  29. Freund Y, Schapire RE (1997) A decision-theoretic generalization of online learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MATH  Google Scholar 

  30. Lin S, Wang C, Wu Z, Chung Y (2013) Detect rare events via MICE algorithm with optimal threshold. In: The proceeding of 7th international conference on innovative mobile and internet services in ubiquitous computing (IEEE), pp 70–75

  31. Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern-Part A: Syst Hum 40(1):185–197

    Article  Google Scholar 

  32. Oh S, Lee MS, Zhang B (2011) Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinf 8(2):316–325

    Article  Google Scholar 

  33. Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY (2014) Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern 44(3):445–455

    Article  Google Scholar 

  34. Ditzler G, Polikar R (2013) Incremental learning of concept drift from streaming imbalanced data. IEEE Trans Knowl Data Eng 25(10):2283–2301

    Article  Google Scholar 

  35. Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: The proceeding of 6th international conference on data mining (ICDM), pp 592–602

  36. Krawczyk B, Schaefer G, Wozniak M (2013) Combining one-class classifiers for imbalanced classification of breast thermogram features. In: The proceeding of the IEEE 4th international workshop on computational intelligence in medical imaging (CIMI), 2013, pp 36–41

  37. Wang S, Minku LL, Yao X (2015) Resampling-based ensemble methods for online class imbalance learning. IEEE Trans Knowl Data Eng 27(5):1356–1368

    Article  Google Scholar 

  38. Ahumada H, Grinblat GL, Uzal LC, Granitto PM, Ceccatto A (2008) REPMAC: A new hybrid approach to highly imbalanced classification problems. In: The proceeding of 8th international conference on hybrid intelligent systems (IEEE) pp 386–391

  39. Jeatrakul P, Wong KW (2012) Enhancing classification performance of multi-class imbalanced data using the OAA-DB algorithm. In: The proceeding of IEEE world congress on computational intelligence (WCCI), pp 1–8

  40. Tan SC, Watada J, Ibrahim Z, Khalid M, Jau LW, Chew LC (2011), Learning with imbalanced datasets using fuzzy ARTMAP-based neural network models. In: The proceeding of IEEE international conference on fuzzy systems, 2011, Taiwan, pp 1084–1089

  41. Cao P, Li B, Zhao D, Zaiane O (2013) A novel cost sensitive neural network ensemble for multiclass imbalance data learning. In: The proceeding of international joint conference on neural networks (IJCNN- IEEE) pp 1–8

  42. Fu J, Lee S (2011) Certainty-enhanced active learning for improving imbalanced data classification. In: The proceeding of 11th IEEE international conference on data mining workshops, IEEE, pp 405–412

  43. Antwi DK, Viktor HL, Japkowicz N (2012) The PerfSim algorithm for concept drift detection in imbalanced data. In: The proceeding of 12th IEEE international conference on data mining workshops, pp 619–628

  44. Alhammady H, Ramamohanarao K (2004) Using emerging patterns and decision trees in rare-class classification. In: The proceedings of the 4th IEEE international conference on data mining (ICDM’04), pp 315–318

  45. Wang P, Wang H, Wu X, Wang W, Shi B (2007) A low-granularity classifier for data streams with concept drifts and biased class distribution. IEEE Trans Knowl Data Eng 19(9):1202–1213

    Article  Google Scholar 

  46. Thach NH, Rojanavasu P, Pinngern O (2008) Cost-sensitive XCS classifier system addressing imbalance problems. In: The proceeding of 5th international conference on fuzzy systems and knowledge discovery, pp 132–136

  47. Orriols-Puig A, Bernadó-Mansilla E, Goldberg DE, Sastry K, Lanzi PL (2009) Facetwise analysis of XCS for problems with class imbalances. IEEE Trans Evol Comput 13(5):1093–1119

    Article  Google Scholar 

  48. He J, Tong H, Carbonell J (2010) Rare category characterization. In: The proceeding of IEEE international conference on data mining, pp 226–235

  49. Wallace BC, Dahabreh IJ (2012) Class probability estimates are unreliable for imbalanced data (and how to fix them). In: The proceeding of IEEE 12th international conference on data mining, pp 695–704

  50. Hospedales TM, Gong S, Xiang T (2013) Finding rare classes: active learning with generative and discriminative models. IEEE Trans Knowl Data Eng 25(2):374–386

    Article  Google Scholar 

  51. Own HS, AAl NAA, Abraham A (2010) A new weighted rough set framework for imbalance class distribution. In: The proceeding of international conference of soft computing and pattern recognition (IEEE), pp 29–34

  52. Huang K, Yang H, King I, Lyu MR (2006) Imbalanced learning with a biased minimax probability machine. IEEE Trans Syst Man Cybern-Part B: Cybern 36(4):913–923

    Article  Google Scholar 

  53. Huang K, Yang H, King I, Lyu MR (2004) Learning classifiers from imbalanced data based on biased minimax probability machine. In: The proceeding of the IEEE computer society conference on computer vision and pattern recognition (CVPR’04), 2004, pp 558–563

  54. Su C, Hsiao Y (2007) An evaluation of the robustness of MTS for imbalanced data. IEEE Trans Knowl Data Eng 19(10):1321–1332

    Article  Google Scholar 

  55. Diamantini C, Potena D (2009) Bayes vector quantizer for class-imbalance problem. IEEE Trans Knowl Data Eng 21(5):638–651

    Article  Google Scholar 

  56. Williams DP, Myers V, Silvious MS (2009) Mine classification with imbalanced data. IEEE Geosci Remote Sens Lett 6(3):528–532

    Article  Google Scholar 

  57. Castro CL, Braga AP (2013) Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE Trans Neural Netw Learn Syst 24(6):888–899

    Article  Google Scholar 

  58. Wu G, Chang EY (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 17(6):786–795

    Article  Google Scholar 

  59. Chen S, He H (2009) SERA: selectively recursive approach towards nonstationary imbalanced stream data mining. In: The proceeding of international joint conference on neural networks (IEEE) USA, pp 522–529

  60. Fu J, Lee S (2011) Certainty-enhanced active learning for improving imbalanced data classification. In: The proceeding of the 11th IEEE international conference on data mining workshops, 2011, pp 405–412

  61. Yang Z, Gao D (2012) An active under-sampling approach for imbalanced data classification. In: The proceeding of the 5th international symposium on computational intelligence and design (IEEE), pp 270–273

  62. Kwak J, Lee T, Kim CO (2015) An incremental clustering-based fault detection algorithm for class-imbalanced process data. IEEE Trans Semicond Manuf 28(3):1–11

    Article  Google Scholar 

  63. Zhang X, Hu B (2014) A new strategy of cost-free learning in the class imbalance problem. IEEE Trans Knowl Data Eng 26(12):2872–2885

    Article  Google Scholar 

  64. Park S, Ha Y (2014) Large imbalance data classification based on mapreduce for traffic accident prediction. In: The proceeding of 8th international conference on innovative mobile and internet services in ubiquitous computing, pp. 45–49

  65. Das B, Krishnan NC, Cook DJ (2015) RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234

    Article  Google Scholar 

  66. Yu X, Zhang X (2012) Imbalanced data classification algorithm based on hybrid model. In: The proceeding of international conference on machine learning and cybernetics (IEEE) pp 735–740

  67. Tang Y, Zhang Y, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man and Cybern-Part B: Cybern 39(1):281–288

    Article  Google Scholar 

  68. Phoungphol P, Zhang Y, Zhao Y, Srichandan B (2012) Multiclass SVM with ramp loss for imbalanced data classification. In: The proceeding of the IEEE international conference on granular computing, 2012, pp 376–381

  69. Zhou X, Lu S, Hu L, Zhang M (2012) Imbalanced extreme support vector machine. In: The proceeding of the international conference on machine learning and cybernetics (IEEE), 2012, pp 483–489

  70. Anand R, Mehrotra KG, Mohan KC, Ranka S (1993) An improved algorithm for neural network classification of imbalanced training sets. IEEE Trans Neural Netw 4(6):962–969

    Article  Google Scholar 

  71. Lin M, Tang K, Yao X (2013) Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Trans Neural Netw Learn Syst 24(4):647–660

    Article  Google Scholar 

  72. Vorraboot P, Rasmequan S, Lursinsap C, Chinnasarn K (2012) A modified error function for Imbalanced dataset classification problem. In: The proceeding of 7th international conference on computing and convergence technology (ICCCT-IEEE), pp 854–859

  73. Lee MS, Oh S, Zhang B (2009) Ensemble learning based on active example selection for solving imbalanced data problem in biomedical data. In: The proceeding of IEEE international conference on bioinformatics and biomedicine, pp 350–355

  74. Murphey YL, Wang H, Ou G, Feldkamp LA (2007), OAHO: an effective algorithm for multi-class learning from imbalanced data. In: The proceeding of international joint conference on neural networks (IEEE) USA, pp 406–411

  75. Nguyen HM, Cooper EW, Kamei K (2011) Online learning from imbalanced data streams. In: The proceeding of international conference of soft computing and pattern recognition (SoCPaR-IEEE), pp 347–352

  76. Koknar-Tezel S, Latecki LJ (2009) Improving SVM classification on imbalanced data sets in distance spaces. In: The proceeding of 9th IEEE international conference on data mining, pp 259–267

  77. Zhou B, Yang C, Guo H, Hu J (2013) A Quasi-linear SVM combined with assembled SMOTE for imbalanced data classification. In: The proceeding of international joint conference on neural networks (IJCNN-IEEE), 2013, pp 1–7

  78. Pengfei J, Chunkai Z, Zhenyu H (2014) A new sampling approach for classification of imbalanced data sets with high density. In: The proceeding of international conference on big data and smart computing (BigComp-IEEE) pp 217–222

  79. Huang H, Lin Y, Chen Y, Lu H (2012) Imbalanced data classification using random subspace method and SMOTE. In: The proceeding of joint 6th international conference on soft computing and intelligent systems (SCIS) and 13th international symposium on advanced intelligent systems (ISIS), 2012, Japan, pp 817–820

  80. Rashu RI, Haq N, Rahman RM (2014) Data mining approaches to predict final grade by overcoming class imbalance problem. In: The proceeding of 17th international conference on computer and information technology (ICCIT), pp 14–19

  81. Han J, Kamber M (2006) Data Mining : Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, Burlington

    MATH  Google Scholar 

  82. Muda Z, Yassin W, Sulaiman MN, Udzir NI (2011) Intrusion detection based on K-means clustering and Naïve Bayes classification. In: Proceedings of 7th International Conference on IT in Asia (CITA-IEEE) pp 1–6

  83. Attar V, Sinha P, Wankhade K (2010) A fast and light classifier for data streams. Spring Evolv Syst 1(3):199–207

    Article  Google Scholar 

  84. Cheng D, Kannan R, Vempala S, Wang G (2006) A divide-and-merge methodology for clustering. ACM Trans Database Syst 21(4):1499–1525

    Article  Google Scholar 

  85. UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.html

  86. Oza N, Russell S (2001) Online bagging and boosting. In: Artificial intelligence and statistics, Morgan Kaufmann, pp 105–112

  87. Pelossof R, Jones M, Vovsha I, Rudin C (2008) Online coordinate boosting, pp 1–9. arXiv:0810.4553

  88. Bieft A, Holmes G, Pfahringer B, Kirkby R, Gavalda R (2009) New ensemble methods for evolving data streams. In: KDD, pp 139–148

  89. Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, Morgan Kaufmann series in data management systems, 2nd ed, pp 1–525

Download references

Acknowledgements

We are thankful to reviewers for their valuable comments and suggestions which help us in the revision of this paper. We are also thankful to editor and his team for their support and guidance. At last, we are also thankful to late Dr. Ravindra C. Thool for their kind support, guidance and motivation during research and at every stages of life.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kapil Keshao Wankhade.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wankhade, K.K., Jondhale, K.C. & Thool, V.R. A hybrid approach for classification of rare class data. Knowl Inf Syst 56, 197–221 (2018). https://doi.org/10.1007/s10115-017-1114-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1114-5

Keywords

Navigation