Skip to main content
Log in

A design of information granule-based under-sampling method in imbalanced data classification

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In numerous real-world problems, we are faced with difficulties in learning from imbalanced data. The classification performance of a “standard” classifier (learning algorithm) is evidently hindered by the imbalanced distribution of data. The over-sampling and under-sampling methods have been researched extensively with the aim to increase the predication accuracy over the minority class. However, traditional under-sampling methods tend to ignore important characteristics pertinent to the majority class. In this paper, a novel under-sampling method based on information granules is proposed. The method exploits the concepts and algorithms of granular computing. First, information granules are built around the selected patterns coming from the majority class to capture the essence of the data belonging to this class. In the sequel, the resultant information granules are evaluated in terms of their quality and those with the highest specificity values are selected. Next, the selected numeric data are augmented by some weights implied by the size of information granules. Finally, a support vector machine and a K-nearest-neighbor classifier, both being regarded here as representative classifiers, are built based on the weighted data. Experimental studies are carried out using synthetic data as well as a suite of imbalanced data sets coming from the public machine learning repositories. The experimental results quantify the performance of support vector machine and K-nearest-neighbor with under-sampling method based on information granules. The results demonstrate the superiority of the performance obtained for these classifiers endowed with conventional under-sampling method. In general, the improvement of performance expressed in terms of G-means is over 10% when applying information granule under-sampling compared with random under-sampling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Abualigah LMQ (2018) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin

    Google Scholar 

  • Abualigah LMQ, Hanandeh ES (2015) “Applying genetic algorithms to information retrieval using vector space model. Int J Comput Sci Eng Appl IJCSEA 5(1):19–28

    Google Scholar 

  • Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795

    Google Scholar 

  • Abualigah LM, Khader AT, Hanandeh ES (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435

    Google Scholar 

  • Abualigah LM, Khader AT, Hanandeh ES (2018a) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456–466

    Google Scholar 

  • Abualigah LM, Khader AT, Hanandeh ES (2018b) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047–4071

    Google Scholar 

  • Abualigah LM, Khader AT, Hanandeh ES (2018c) A combination of objective functions and hybrid Krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125

    Google Scholar 

  • Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17(2–3):255–287

    Google Scholar 

  • Alibeigi M, Hashemi S, Hamzeh A (2012) DBFS: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets. Data Knowl Eng 81–82:67–103

    Google Scholar 

  • Barua S, Islam MM, Yao X, Muras K (2014) MWMOTE—Majority weighted minority over-sampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Google Scholar 

  • Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

    Google Scholar 

  • Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of 5th annual ACM workshop on computer learning theory. ACM Press, Pittsburgh, PA, pp 144–152

  • Bunkhumpornpat C, Sinapiromsaran K (2017) DBMUTE: density-based majority under-sampling technique. Knowl Inf Syst 50(3):827–850

    Google Scholar 

  • Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Adv Knowl Discov Data Min 5476:475–482

    Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357

    MATH  Google Scholar 

  • Chawla N, Lazarevic A, Hall L, Bowyer K (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of 7th European conference on principles and practice of knowledge discovery in databases (PKDD), Dubrovnik, Croatia, pp 107–119

  • Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Disc 17(2):225–252

    MathSciNet  Google Scholar 

  • Coomans D, Massart DL (1982) Alternative k-nearest neighbour rules in supervised pattern recognition: part 1. K-Nearest neighbour classification by using alternative voting rules. Anal Chim Acta 136(APR):15–27

    Google Scholar 

  • Duda R, Hart P (1973) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  • Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 17th international joint conference on artificial intelligence, pp 973–978

  • Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36

    MathSciNet  Google Scholar 

  • Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of 13th international conference on machine learning, Bari, Italy, pp 148–156

  • Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484

    Google Scholar 

  • Galar M, Fernández A, Barrenechea E, Herrera F (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn 46(12):3460–3471

    Google Scholar 

  • Gao XY, Chen ZY, Tang S, Zhang YD, Li JT (2016) Adaptive weighted imbalance learning with application to abnormal activity recognition. Neurocomputing 173(3):1927–1935

    Google Scholar 

  • Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Int Conf Intell Comput 3644(5):878–887

    Google Scholar 

  • He HB, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Google Scholar 

  • He HB, Ma YQ (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, New York

    MATH  Google Scholar 

  • He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of IEEE international joint conference on neural networks, pp 1322–1328

  • Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear SVM. In: Proceedings of 25th international conference on Machine learning, Helsinki, Finland, pp 408–415

  • Hulse JV, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of 24th international conference on machine learning, pp 935–942

  • Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    MATH  Google Scholar 

  • Jian CX, Gao J, Ao YH (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122

    Google Scholar 

  • Kang Q, Chen XS, Li SS, Zhou MC (2017) A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans Cybern 47(12):4263–4274

    Google Scholar 

  • Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232

    Google Scholar 

  • Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of international conference on machine learning, pp 179–186

  • Li QJ, Mao YB (2014) A review of boosting methods for imbalanced data classification. Pattern Anal Appl 17(4):679–693

    MathSciNet  MATH  Google Scholar 

  • Lichman M (2013) UCI Machine Learning Repository, School of Information and Computer Sciences, University of California, Irvine. http://archive.ics.uci.edu/ml

  • Long PM, Servedio RA (2010) Random classification noise defeats all convex potential boosters. Mach Learn 78(3):287–304

    MathSciNet  Google Scholar 

  • Majid A, Ali S, Iqbal M, Kausar N (2014) Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed 113(3):792–808

    Google Scholar 

  • Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597

    Google Scholar 

  • Pavón R, Laza R, Reboiro-Jato M, Fdez-Riverola F (2011) Assessing the impact of class-imbalanced data for classifying relevant/irrelevant medline documents. Adv Intell Soft Comput 93:345–353

    Google Scholar 

  • Pedrajas NG, Rodríguez JP, Pedrajas MG, Boyer DO, Fyfe C (2012) Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl Based Syst 25(1):22–34

    Google Scholar 

  • Pedrycz W (2007) Granular computing-the emerging paradigm. J Uncertain Syst 1(1):38–61

    Google Scholar 

  • Pedrycz W (2013) Granular computing: analysis and design of intelligent systems. CRC. Press/Francis Taylor, Boca Raton

    Google Scholar 

  • Pedrycz W, Homenda W (2013) Building the fundamentals of granular computing: a principle of justifiable granularity. Appl Soft Comput 13(10):4209–4218

    Google Scholar 

  • Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1):1–39

    MathSciNet  Google Scholar 

  • Santosab MS, Abreuab PH, García-Laencinac PJ, Simãod A, Carvalhod A (2015) A new cluster-based over-sampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59

    Google Scholar 

  • Seiffert C, Khoshgoftaar T, van Hulse J (2009) Improving software-quality predictions with data sampling and boosting. IEEE Trans Syst Man Cybern Part A Syst Hum 39(6):1283–1294

    Google Scholar 

  • Seiffert C, Khoshgoftaar TM, van van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197

    Google Scholar 

  • Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for SVM. Math Program 127(1):3–30

    MathSciNet  MATH  Google Scholar 

  • Su CT, Chen LS, Yih Y (2006) Knowledge acquisition through information granulation for imbalanced data. Expert Syst Appl 31(3):531–541

    Google Scholar 

  • Sun YM, Kamel MS, Wong AKC, Wang Y (2007) Cost sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378

    MATH  Google Scholar 

  • Sun YM, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(4):687–719

    Google Scholar 

  • Sun T, Jiao L, Feng J, Liu F, Zhang X (2014) Imbalanced hyperspectral image classification based on maximum margin. IEEE Geosci Remote Sens Lett 12(3):522–526

    Google Scholar 

  • Sun ZB, Song QB, Zhu XY, Sun HL, Xu BW, Zhou YM (2015) A novel ensemble method for classifying imbalanced data. Pattern Recogn 48(5):1623–1637

    Google Scholar 

  • Thomas C (2013) Improving intrusion detection for imbalanced network traffic. Secur Commun Netw 6(3):309–324

    Google Scholar 

  • Vapnik VN (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  • Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25(1):1–20

    Google Scholar 

  • Wang D, Pedrycz W, Li ZW (2019) Granular data aggregation: an adaptive principle of the justifiable granularity approach. IEEE Trans Cybern 49(2):417–426

    Google Scholar 

  • Wei W, Li J, Cao L, Ou Y, Chen J (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16(4):449–475

    Google Scholar 

  • Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: Proceedings of international conference on machine learning 2003 workshop on learning from imbalanced data sets II, Washington, DC

  • Xu KJ, Pedrycz W, Li ZW, Nie XK (2019a) Constructing a virtual space for enhancing the classification performance of fuzzy clustering. IEEE Trans Fuzzy Syst. https://doi.org/10.1109/TFUZZ.2018.2889020

    Article  Google Scholar 

  • Xu KJ, Pedrycz W, Li ZW (2019b) High-accuracy signal subspace separation algorithm based on Gaussian kernel soft partition. IEEE Trans Ind Electron 66(1):491–499

    Google Scholar 

  • Yu HL, Ni J, Zhao J (2013) ACOSampling: an ant colony optimization-based under-sampling method for classifying imbalanced DNA microarray data. Neurocomputing 101(2):309–318

    Google Scholar 

  • Zadeh LA (1997) Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst 90(2):111–117

    MathSciNet  MATH  Google Scholar 

  • Zadeh LA (2005) Toward a generalized theory of uncertainty (GTU)—an outline. Inf Sci 172(1–2):1–40

    MathSciNet  MATH  Google Scholar 

  • Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of 21st international conference on machine learning (ICML)

  • Zhang HX, Li MF (2014) RWO-Sampling: a random walk over-sampling approach to imbalanced data classification, Information Fusion. Inf Fusion 20:99–116

    Google Scholar 

  • Zhu XB, Pedrycz W, Li ZW (2017a) Granular data description: designing ellipsoidal information granules. IEEE Trans Cybern 47(12):4475–4484

    Google Scholar 

  • Zhu XB, Pedrycz W, Li ZW (2017b) Granular encoders and decoders: a study in processing information granules. IEEE Trans Fuzzy Syst 25(5):1115–1126

    Google Scholar 

  • Zhu XB, Pedrycz W, Li ZW (2018a) Granular representation of data: a design of families of ɛ-information granules. IEEE Trans Fuzzy Syst 26(4):2107–2119

    Google Scholar 

  • Zhu XB, Pedrycz W, Li ZW (2018b) A design of granular Takagi-Sugeno fuzzy model through the synergy of fuzzy subspace clustering and optimal allocation of information granularity. IEEE Trans Fuzzy Syst 26(5):2499–2509

    Google Scholar 

  • Zhu XB, Pedrycz W, Li ZW (2018c) Granular models and granular outliers. IEEE Trans Fuzzy Syst 26(6):3835–3846

    Google Scholar 

  • Zhu XB, Pedrycz W, Li ZW (2019a) A development of granular input space in system modeling. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2019.2899633

    Article  Google Scholar 

  • Zhu XB, Pedrycz W, Li ZW (2019b) Development and analysis of neural networks realized in the presence of granular data. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2019.2945307

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61472295, 61672400, the Recruitment Program of Global Experts, Canada Research Chair (CRC), Natural Sciences and Engineering Research Council of Canada (NSERC) and the Science and Technology Development Fund, MSAR, under Grant No. 0012/2019/A3, the National Key R&D Program of China under Grant 2018YFB1700104, and Guangxi Key Laboratory of Trusted Software under Grant No. kx201926.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiubin Zhu.

Ethics declarations

Conflict of interest

There is no conflict of interest with any of the suggested reviewers. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, T., Zhu, X., Pedrycz, W. et al. A design of information granule-based under-sampling method in imbalanced data classification. Soft Comput 24, 17333–17347 (2020). https://doi.org/10.1007/s00500-020-05023-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-020-05023-2

Keywords

Navigation