Abstract
Learning of rare class data is a challenging problem in field of classification process. A rare class or imbalanced class learning is the common problem faced by many real-world applications, because of this many researcher work focused on this issue. Rare class data always generate wrong results because of overwhelming accuracy of minority class by majority class. There are lots of methods been proposed to handle imbalanced class or rare class or skew class problem. This paper proposes a hybrid method, i. e. classification- and clustering-based method, solving rare class problem. This proposed hybrid method uses k-means, ensemble and divide and merge methods. This method tries to improve detection rate of every class. For experimental work, the proposed method is tested on real datasets. The experimental results show that proposed method works well as compared with other algorithms.
Similar content being viewed by others
References
Gudadhe M, Prakash P, Wankhade K (2010) A new data mining based network intrusion detection model. In: The proceedings of international conference on computer and communication technology (IEEE), Allahabad, India, pp 731–735
Medioni G, Cohen I, Brémond F, Hongeng S, Nevatia R (2001) Event detection and analysis from video streams. IEEE Trans Pattern Anal Mach Intell 2001 23(8):873–889
Zhong H, Shi J, Visontai M (2004) Detecting unusual activity in video. In: The proceeding of the IEEE computer society conference on computer vision and pattern recognition (CVPR’04), 2004, Washington, DC, 2:819–826
Huda S, Yearwood J, Jelinek HF, Hassan MM, Fortino G, Buckland M (2016) A hybrid feature selection with ensemble classification for imbalanced healthcare data: a case study for brain tumor diagnosis. IEEE Access 4:9145–9154
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: The proceedings of 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’94), Dublin, Ireland, pp 3–12
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. J Mach Learn 30(2):195–215
Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor Newslett 6(1):50–59 (Special Issue on Learning from Imbalanced Datasets)
Sit WY, Mao KZ (2013) Learning imbalanced classes in the presence of concept growth. In: The proceeding of IEEE conference on evolving and adaptive intelligent systems (EAIS), 2013, pp 62–69
Lin SC, Chang CYI, Yang WN (2009) Meta-learning for imbalanced data and classification ensemble in binary classification. J Neurocomput 73(1–3):484–494
Khoshgoftaar TM, Seiffert C, Hulse JV, Napolitano A, Folleco A (2007) Learning with limited minority class data. In: The proceeding of 6th international conference on machine learning and applications (IEEE), pp 348–353
Wang S, Yao X (2013) Relationships between diversity of classification ensembles and single-class performance measures. IEEE Trans Knowl Data Eng 25(1):206–219
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2007) Mining data with rare events: a case study. In: The proceeding of the 19th IEEE international conference on tools with artificial intelligence, pp 132–139
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging–boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern-Part C: Appl Rev 42(4):463–484
Krawczyk B, Schaefer G, Wozniak M (2013) An evaluation of classifier ensembles for class imbalance problems. In: The proceeding of international conference on informatics, electronics and vision (ICIEV-IEEE), pp 1–4
Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern-Part B: Cybern 42(4):1119–1130
Liu N, Woon WL, Aung Z, Afshari A (2014) Handling class imbalance in customer behavior prediction. In: The proceedings of international conference on collaboration technologies and systems (CTS-IEEE), pp 100–103
Yang Z, Tang W, Shintemirov A, Wu Q (2009) Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers. IEEE Trans Syst Man Cybern—Part C Appl Rev 39(6):597–610
Zhu ZB, Song ZH (2010) Fault diagnosis based on imbalance modified kernel fisher discriminant analysis. J Chem Eng Res Des 88(8):936–951
Khreich W, Granger E, Miri A, Sabourin R (2010) Iterative boolean combination of classifiers in the roc space: an application to anomaly detection with HMMs. J Pattern Recognit 43(8):2732–2752
Tavallaee M, Stakhanova N, Ghorbani A (2010) Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Trans Syst Cybern: Part C Appl Rev 40(5):516–524
Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436
del Castillo MD, Serrano JI (2004) A multi strategy approach for digital text categorization from imbalanced documents. ACM SIGKDD Explor Newslett 6(1):70–79 (Special Issue on Learning from Imbalanced Datasets)
Turney PD (2000) Learning algorithms for key phrase extraction. J Inf Retr 2(4):303–336
Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: The proceedings of 4th international conference on knowledge discovery and data mining (KDD), pp 73–79
Bermejo P, Gamez JA, Puerta JM (2011) Improving the performance of naive bayes multinomial in e-mail foldering by introducing distribution based balance of datasets. J Expert Syst Appl 38(3):2072–2080
Liu YH, Chen YT (2005) Total margin-based adaptive fuzzy support vector machines for multiview face recognition. In: The proceeding IEEE international conference on system, man and cybernetics 2:1704–1711
Breiman L (1996) Bagging predictors. J Mach Learn 24(2):123–140
Freund Y, Schapire RE (1997) A decision-theoretic generalization of online learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Lin S, Wang C, Wu Z, Chung Y (2013) Detect rare events via MICE algorithm with optimal threshold. In: The proceeding of 7th international conference on innovative mobile and internet services in ubiquitous computing (IEEE), pp 70–75
Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern-Part A: Syst Hum 40(1):185–197
Oh S, Lee MS, Zhang B (2011) Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinf 8(2):316–325
Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY (2014) Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern 44(3):445–455
Ditzler G, Polikar R (2013) Incremental learning of concept drift from streaming imbalanced data. IEEE Trans Knowl Data Eng 25(10):2283–2301
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: The proceeding of 6th international conference on data mining (ICDM), pp 592–602
Krawczyk B, Schaefer G, Wozniak M (2013) Combining one-class classifiers for imbalanced classification of breast thermogram features. In: The proceeding of the IEEE 4th international workshop on computational intelligence in medical imaging (CIMI), 2013, pp 36–41
Wang S, Minku LL, Yao X (2015) Resampling-based ensemble methods for online class imbalance learning. IEEE Trans Knowl Data Eng 27(5):1356–1368
Ahumada H, Grinblat GL, Uzal LC, Granitto PM, Ceccatto A (2008) REPMAC: A new hybrid approach to highly imbalanced classification problems. In: The proceeding of 8th international conference on hybrid intelligent systems (IEEE) pp 386–391
Jeatrakul P, Wong KW (2012) Enhancing classification performance of multi-class imbalanced data using the OAA-DB algorithm. In: The proceeding of IEEE world congress on computational intelligence (WCCI), pp 1–8
Tan SC, Watada J, Ibrahim Z, Khalid M, Jau LW, Chew LC (2011), Learning with imbalanced datasets using fuzzy ARTMAP-based neural network models. In: The proceeding of IEEE international conference on fuzzy systems, 2011, Taiwan, pp 1084–1089
Cao P, Li B, Zhao D, Zaiane O (2013) A novel cost sensitive neural network ensemble for multiclass imbalance data learning. In: The proceeding of international joint conference on neural networks (IJCNN- IEEE) pp 1–8
Fu J, Lee S (2011) Certainty-enhanced active learning for improving imbalanced data classification. In: The proceeding of 11th IEEE international conference on data mining workshops, IEEE, pp 405–412
Antwi DK, Viktor HL, Japkowicz N (2012) The PerfSim algorithm for concept drift detection in imbalanced data. In: The proceeding of 12th IEEE international conference on data mining workshops, pp 619–628
Alhammady H, Ramamohanarao K (2004) Using emerging patterns and decision trees in rare-class classification. In: The proceedings of the 4th IEEE international conference on data mining (ICDM’04), pp 315–318
Wang P, Wang H, Wu X, Wang W, Shi B (2007) A low-granularity classifier for data streams with concept drifts and biased class distribution. IEEE Trans Knowl Data Eng 19(9):1202–1213
Thach NH, Rojanavasu P, Pinngern O (2008) Cost-sensitive XCS classifier system addressing imbalance problems. In: The proceeding of 5th international conference on fuzzy systems and knowledge discovery, pp 132–136
Orriols-Puig A, Bernadó-Mansilla E, Goldberg DE, Sastry K, Lanzi PL (2009) Facetwise analysis of XCS for problems with class imbalances. IEEE Trans Evol Comput 13(5):1093–1119
He J, Tong H, Carbonell J (2010) Rare category characterization. In: The proceeding of IEEE international conference on data mining, pp 226–235
Wallace BC, Dahabreh IJ (2012) Class probability estimates are unreliable for imbalanced data (and how to fix them). In: The proceeding of IEEE 12th international conference on data mining, pp 695–704
Hospedales TM, Gong S, Xiang T (2013) Finding rare classes: active learning with generative and discriminative models. IEEE Trans Knowl Data Eng 25(2):374–386
Own HS, AAl NAA, Abraham A (2010) A new weighted rough set framework for imbalance class distribution. In: The proceeding of international conference of soft computing and pattern recognition (IEEE), pp 29–34
Huang K, Yang H, King I, Lyu MR (2006) Imbalanced learning with a biased minimax probability machine. IEEE Trans Syst Man Cybern-Part B: Cybern 36(4):913–923
Huang K, Yang H, King I, Lyu MR (2004) Learning classifiers from imbalanced data based on biased minimax probability machine. In: The proceeding of the IEEE computer society conference on computer vision and pattern recognition (CVPR’04), 2004, pp 558–563
Su C, Hsiao Y (2007) An evaluation of the robustness of MTS for imbalanced data. IEEE Trans Knowl Data Eng 19(10):1321–1332
Diamantini C, Potena D (2009) Bayes vector quantizer for class-imbalance problem. IEEE Trans Knowl Data Eng 21(5):638–651
Williams DP, Myers V, Silvious MS (2009) Mine classification with imbalanced data. IEEE Geosci Remote Sens Lett 6(3):528–532
Castro CL, Braga AP (2013) Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE Trans Neural Netw Learn Syst 24(6):888–899
Wu G, Chang EY (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 17(6):786–795
Chen S, He H (2009) SERA: selectively recursive approach towards nonstationary imbalanced stream data mining. In: The proceeding of international joint conference on neural networks (IEEE) USA, pp 522–529
Fu J, Lee S (2011) Certainty-enhanced active learning for improving imbalanced data classification. In: The proceeding of the 11th IEEE international conference on data mining workshops, 2011, pp 405–412
Yang Z, Gao D (2012) An active under-sampling approach for imbalanced data classification. In: The proceeding of the 5th international symposium on computational intelligence and design (IEEE), pp 270–273
Kwak J, Lee T, Kim CO (2015) An incremental clustering-based fault detection algorithm for class-imbalanced process data. IEEE Trans Semicond Manuf 28(3):1–11
Zhang X, Hu B (2014) A new strategy of cost-free learning in the class imbalance problem. IEEE Trans Knowl Data Eng 26(12):2872–2885
Park S, Ha Y (2014) Large imbalance data classification based on mapreduce for traffic accident prediction. In: The proceeding of 8th international conference on innovative mobile and internet services in ubiquitous computing, pp. 45–49
Das B, Krishnan NC, Cook DJ (2015) RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234
Yu X, Zhang X (2012) Imbalanced data classification algorithm based on hybrid model. In: The proceeding of international conference on machine learning and cybernetics (IEEE) pp 735–740
Tang Y, Zhang Y, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man and Cybern-Part B: Cybern 39(1):281–288
Phoungphol P, Zhang Y, Zhao Y, Srichandan B (2012) Multiclass SVM with ramp loss for imbalanced data classification. In: The proceeding of the IEEE international conference on granular computing, 2012, pp 376–381
Zhou X, Lu S, Hu L, Zhang M (2012) Imbalanced extreme support vector machine. In: The proceeding of the international conference on machine learning and cybernetics (IEEE), 2012, pp 483–489
Anand R, Mehrotra KG, Mohan KC, Ranka S (1993) An improved algorithm for neural network classification of imbalanced training sets. IEEE Trans Neural Netw 4(6):962–969
Lin M, Tang K, Yao X (2013) Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Trans Neural Netw Learn Syst 24(4):647–660
Vorraboot P, Rasmequan S, Lursinsap C, Chinnasarn K (2012) A modified error function for Imbalanced dataset classification problem. In: The proceeding of 7th international conference on computing and convergence technology (ICCCT-IEEE), pp 854–859
Lee MS, Oh S, Zhang B (2009) Ensemble learning based on active example selection for solving imbalanced data problem in biomedical data. In: The proceeding of IEEE international conference on bioinformatics and biomedicine, pp 350–355
Murphey YL, Wang H, Ou G, Feldkamp LA (2007), OAHO: an effective algorithm for multi-class learning from imbalanced data. In: The proceeding of international joint conference on neural networks (IEEE) USA, pp 406–411
Nguyen HM, Cooper EW, Kamei K (2011) Online learning from imbalanced data streams. In: The proceeding of international conference of soft computing and pattern recognition (SoCPaR-IEEE), pp 347–352
Koknar-Tezel S, Latecki LJ (2009) Improving SVM classification on imbalanced data sets in distance spaces. In: The proceeding of 9th IEEE international conference on data mining, pp 259–267
Zhou B, Yang C, Guo H, Hu J (2013) A Quasi-linear SVM combined with assembled SMOTE for imbalanced data classification. In: The proceeding of international joint conference on neural networks (IJCNN-IEEE), 2013, pp 1–7
Pengfei J, Chunkai Z, Zhenyu H (2014) A new sampling approach for classification of imbalanced data sets with high density. In: The proceeding of international conference on big data and smart computing (BigComp-IEEE) pp 217–222
Huang H, Lin Y, Chen Y, Lu H (2012) Imbalanced data classification using random subspace method and SMOTE. In: The proceeding of joint 6th international conference on soft computing and intelligent systems (SCIS) and 13th international symposium on advanced intelligent systems (ISIS), 2012, Japan, pp 817–820
Rashu RI, Haq N, Rahman RM (2014) Data mining approaches to predict final grade by overcoming class imbalance problem. In: The proceeding of 17th international conference on computer and information technology (ICCIT), pp 14–19
Han J, Kamber M (2006) Data Mining : Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, Burlington
Muda Z, Yassin W, Sulaiman MN, Udzir NI (2011) Intrusion detection based on K-means clustering and Naïve Bayes classification. In: Proceedings of 7th International Conference on IT in Asia (CITA-IEEE) pp 1–6
Attar V, Sinha P, Wankhade K (2010) A fast and light classifier for data streams. Spring Evolv Syst 1(3):199–207
Cheng D, Kannan R, Vempala S, Wang G (2006) A divide-and-merge methodology for clustering. ACM Trans Database Syst 21(4):1499–1525
UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.html
Oza N, Russell S (2001) Online bagging and boosting. In: Artificial intelligence and statistics, Morgan Kaufmann, pp 105–112
Pelossof R, Jones M, Vovsha I, Rudin C (2008) Online coordinate boosting, pp 1–9. arXiv:0810.4553
Bieft A, Holmes G, Pfahringer B, Kirkby R, Gavalda R (2009) New ensemble methods for evolving data streams. In: KDD, pp 139–148
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, Morgan Kaufmann series in data management systems, 2nd ed, pp 1–525
Acknowledgements
We are thankful to reviewers for their valuable comments and suggestions which help us in the revision of this paper. We are also thankful to editor and his team for their support and guidance. At last, we are also thankful to late Dr. Ravindra C. Thool for their kind support, guidance and motivation during research and at every stages of life.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wankhade, K.K., Jondhale, K.C. & Thool, V.R. A hybrid approach for classification of rare class data. Knowl Inf Syst 56, 197–221 (2018). https://doi.org/10.1007/s10115-017-1114-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1114-5