Skip to main content
Log in

A dissimilarity-based imbalance data classification algorithm

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Class imbalances have been reported to compromise the performance of most standard classifiers, such as Naive Bayes, Decision Trees and Neural Networks. Aiming to solve this problem, various solutions have been explored mainly via balancing the skewed class distribution or improving the existing classification algorithms. However, these methods pay more attention on the imbalance distribution, ignoring the discriminative ability of features in the context of class imbalance data. In this perspective, a dissimilarity-based method is proposed to deal with the classification of imbalanced data. Our proposed method first removes the useless and redundant features by feature selection from the given data set; and then, extracts representative instances from the reduced data as prototypes; finally, projects the reduced data into a dissimilarity space by constructing new features, and builds the classification model with data in the dissimilarity space. Extensive experiments over 24 benchmark class imbalance data sets show that, compared with seven other imbalance data tackling solutions, our proposed method greatly improves the performance of imbalance learning, and outperforms the other solutions with all given classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Arkadev AG, Braverman ĖM (1967) Computers and pattern recognition. Thompson Book Co, Washington D.C.

    Google Scholar 

  2. Barandela R, Sánchez JS, Garcıa V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recog 36(3):849–851

    Article  Google Scholar 

  3. Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter 6(1):20–29

    Article  Google Scholar 

  4. Berndt DJ, Clifford J (1994) Using dynamic time warping to find patterns in time series In: KDD workshop, vol 10. Seattle, WA, pp 359–370

  5. Bradley PS, Mangasarian OL, Street W (1998) Feature selection via mathematical programming. INFORMS J Comput 10:209–217

    Article  MATH  MathSciNet  Google Scholar 

  6. Breiman L (1996) Bagging predictors. Mach learn 24(2):123–140

    MATH  MathSciNet  Google Scholar 

  7. Chawla NV (2005) Data mining for imbalanced datasets: An overview. In: Data mining and knowledge discovery handbook. Springer, New York, pp 853–867

  8. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: Synthetic minority over-sampling technique. J Artif Intell Res 16:341–378

    Google Scholar 

  9. Chawla NV, Japkowicz N, Kotcz A (2004) Editorial special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6(1):1–6

    Article  Google Scholar 

  10. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: Improving prediction of the minority class in boosting.. In: Knowledge Discovery in Databases: PKDD 2003. Springer, New York, pp 107–119

  11. Chen XW, Wasikowski M (2008) Fast A roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge very and data mining, pp. 124–132. ACM

  12. Cheng Y (1995) Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intel 17(8):790–799

    Article  Google Scholar 

  13. Del Castillo MD, Serrano JI (2004) A multistrategy approach for digital text categorization from imbalanced documents. ACM SIGKDD Explorations Newsletter 6(1):70–79

    Article  Google Scholar 

  14. Domingos P (1999) Metacost a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining ACM, pp 155–164

  15. Duin R (1999) Compactness and complexity of pattern recognition problems. In: International Symposium on Pattern Recognition In Memoriam Pierre Devijver, pp 124–128

  16. Duin R, Juszczak P, Paclik P, Pekalska E, De Ridder D, Tax D, Verzakov S (2000) A matlab toolbox for pattern recognition. PRTools version 3

  17. Duin R, Pekalska E, Ridder D (1999) Relational discriminant analysis. Pattern Recog Lett 20(11):1175–1181

    Article  Google Scholar 

  18. Edelman S (1999) Representation and recognition in vision. MIT press

  19. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  20. Frank A, Asuncion A (2010) Uci machine learning repository irvine, ca: University of california. School of Information and Computer Science, vol 213. http://archive.ics.uci.edu/ml

  21. Goldstone RL, Son JY (2005) Similarity. Cambridge University Press

  22. Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: Fourth International Conference on Natural Computation vol 4 IEEE, pp 192–201

  23. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. The J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  24. Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. thesis, The University of Waikato

    Google Scholar 

  25. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  26. Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299– 310

    Article  Google Scholar 

  27. Jain A, Zongker D (1997) Feature selection: Evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Int 19(2):153–158

    Article  Google Scholar 

  28. Japkowicz N (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on sets, learning from imbalanced data vol 68. CA, Menlo Park

  29. Japkowicz N (2001) Supervised versus unsupervised binary-learning by feedforward neural networks. Mach Learn 42(1-2):97–122

    Article  MATH  Google Scholar 

  30. Japkowicz N, Stephen S (2002) The class imbalance problem: A systematic study. Int Data Anal 6(5):429–449

    MATH  Google Scholar 

  31. Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput 100(11):1025–1034

    Article  Google Scholar 

  32. Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Emp Software Eng 13(5):561–595

    Article  Google Scholar 

  33. Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In: Proceedings IEEE International Conference on Data Mining, pp 257–264

  34. Khoshgoftaar TM, Gao K (2009) Feature selection with imbalanced data for software defect prediction. In: International Conference on Machine Learning and Applications, IEEE, pp 235–240

  35. Khoshgoftaar TM, Gao K, Seliya N (2010) Attribute selection and imbalanced data: Problems in software defect prediction. In: International Conference on Tools with Artificial Intelligence, vol 1 IEEE, pp 137–144

  36. Khoshgoftaar TM, Golawala M, Van Hulse J (2007) An empirical study of learning from imbalanced data using random forest. In: IEEE International Conference on Tools with Artificial Intelligence, vol 2 IEEE, pp 310–317

  37. Kim S, Oommen B (2007) On using prototype reduction schemes to optimize dissimilarity-based classification. Pattern Recog 40(11):2946–2957

    Article  MATH  Google Scholar 

  38. Kim SW, Gao J (2008) On using dimensionality reduction schemes to optimize dissimilarity-based classifiers. In: Progress in Pattern Recognition, Image Analysis and Applications. Springer, pp 309–316

  39. Kim SW, Oommen BJ (2006) On optimizing dissimilarity-based classification using prototype reduction schemes. In: Image Analysis and Recognition. Springer, New York, pp 15–28

  40. Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: A review. GESTS International. Trans Comput Sci Eng 30(1):25–36

    Google Scholar 

  41. Kotsiantis S, Pintelas P (2003) Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing Teleinformatics 1(1):46–55

    Google Scholar 

  42. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol 97, pp 179–186

  43. Latecki LJ, Wang Q, Koknar-Tezel S, Megalooikonomou V (2007) Optimal subsequence bijection. In: Seventh IEEE International Conference on Data Mining, IEEE, pp 565–570

  44. Liaw A, Wiener M (2002) Classification and regression by randomforest. Rnews 2(3):18–22

    Google Scholar 

  45. Liu XY, Zhou ZH (2006) The influence of class imbalance on cost-sensitive learning: An empirical study. In: Sixth International Conference on Data Mining IEEE, pp 970–974

  46. Liu Y, Chawla N, Shriberg E, Stolcke A, Harper M (2003) Resampling techniques for sentence boundary detection: a case study in machine learning from imbalanced data for spoken language processing. Tech. rep

  47. Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: ICML, vol 99, pp 258–267

  48. Novianti PW, Roes KC, Eijkemans MJ (2014) Evaluation of gene expression classification studies: Factors, associated with classification performance. PloS one 9(4) e96:063

  49. Orozco M, García ME, Duin RP, Castellanos CG (2006) Dissimilarity-based classification of seismic signals at nevado del ruiz volcano. Earth Sci Res J 10(2)

  50. Orozco-Alzate M, Castellanos-Domínguez C (2007) Nearest feature rules and dissimilarity representations for face recognition problems Face Recognition; International Journal of Advanced Robotic Systems, Vienna, Austria, pp 337–356

  51. Paclik P, Duin R (2003) Classifying spectral data using relational representation. In: In: Proceedings of the Spectral Imaging Workshop

  52. Paclik P, Duin R (2003) Dissimilarity-based classification of spectra: computational issues. Real-Time Imaging 9(4):237–244

  53. Pang-Ning T, Steinbach M, Kumar V (2007) Introduction to data mining

  54. Pang-Ning T, Steinbach M, Kumar V, et al. (2006) Introduction to data mining. In: Library of Congress

  55. Pedrycz W, Loia V, Senatore S (2004) P-fcm: a proximity based fuzzy clustering. Fuzzy Sets Syst 148(1):21–41

    Article  MATH  MathSciNet  Google Scholar 

  56. Pekalska E, Duin R (2002) Dissimilarity representations allow for building good classifiers. Patte Recognition Letters 23(8):943–956

    Article  MATH  Google Scholar 

  57. Pekalska E, Duin R, Paclik P (2006) Prototype selection for dissimilarity-based classifiers. Pattern Recog 39(2):189–208

    Article  MATH  Google Scholar 

  58. Pekalska E, Duin RP (2000) Classifiers for dissimilarity-based pattern recognition. In: International Conference on Pattern Recognition

  59. Pekalska E, Duin RPW (2006) Dissimilarity-based classification for vectorial representations. In: International Conference on Pattern Recognition, vol 3, pp 137–140

  60. Pekalska E, Paclik P, Duin RP (2002) A generalized kernel approach to dissimilarity-based classification. The J Mach Learn Res 2:175–211

    MATH  MathSciNet  Google Scholar 

  61. Pelayo L, Dick S (2007) Applying novel resampling strategies to software defect prediction. In: In: Conference of the North American Fuzzy Information Processing Society IEEE, pp 69–72

  62. Pkekalska E, Duin RP (2002) Dissimilarity representations allow for building good classifiers. Pattern Recog Lett 23(8):943–956

    Article  Google Scholar 

  63. Pkekalska E, Duin RP (2005) The dissimilarity representation for pattern recognition: foundations and applications. 64. World Scientific

  64. Raskutti B, Kowalczyk A (2004) Extreme re-balancing for svms: a case study. ACM Sigkdd Explorations Newsletter 6(1):60–69

    Article  Google Scholar 

  65. Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition IEEE. Trans Acoustics Speech Signal Process 26(1):43–49

    Article  MATH  Google Scholar 

  66. Song Q, Ni J, Wang G (2013) A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans Knowl Data Eng 25(1):1–14

    Article  Google Scholar 

  67. Sørensen L, Loog M, Lo P, Ashraf H, Dirksen A, Duin RP, de Bruijne M (2010) Image dissimilarity-based quantification of lung disease from CT. Springer

  68. Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 42(6):1806–1817

    Article  Google Scholar 

  69. Van Der Putten P, Van Someren M (2004) A bias-variance analysis of a real world learning problem: The coil challenge 2000. Mach Learn 57(1-2):177–195

    Article  MATH  Google Scholar 

  70. Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on Machine learning. ACM, Corvallis, pp 935–942

  71. Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2009) Feature selection with high-dimensional imbalanced data. In: IEEE International Conference on Data Mining Workshops, IEEE, pp 507–514

  72. Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22:1388–1400

    Article  Google Scholar 

  73. Weiss G (2004) Mining with rarity: a unifying framework. Sigkdd Explorations 6(1):7–19

    Article  Google Scholar 

  74. Weiss GM, Provost F (2001) The effect of class distribution on classifier learning: an empirical study Rutgers University

  75. William C (1995) Fast effective rule induction. In: Twelfth International Conference on Machine Learning, pp 115–123

  76. Yao JK, Dougherty Jr GG, Reddy RD, Keshavan MS, Montrose DM, Matson WR, McEvoy J, Kaddurah-Daouk R (2010) Homeostatic imbalance of purine catabolism in first-episode neuroleptic-naïve patients with schizophrenia. PLoS One 5(3):e9508

    Article  Google Scholar 

  77. Yin L, Ge Y, Xiao K, Wang X, Quan X (2013) Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11

    Article  Google Scholar 

  78. Yu L, Liu H (2003) Feature selection for high-dimensional data: A fast correlation-based filter solution. In: ICML, vol 3, pp 856–863

  79. Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. The J Mach Learn Res 5:1205–1224

    MATH  Google Scholar 

  80. Zheng Z, Srihari R (2003) Optimally combining positive and negative features for text categorization. In: ICML 2003 Workshop

  81. Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter 6(1):80–89

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grants 61373046 and 61210004.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xueying Zhang or Qinbao Song.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Song, Q., Wang, G. et al. A dissimilarity-based imbalance data classification algorithm. Appl Intell 42, 544–565 (2015). https://doi.org/10.1007/s10489-014-0610-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-014-0610-5

Keywords

Navigation