Skip to main content
Log in

An effective distance based feature selection approach for imbalanced data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Class imbalance is one of the critical areas in classification. The challenges become more severe when the data set has a large number of features. Traditional classifiers generally favour the majority class because of skewed class distributions. In recent years, feature selection is being used to select the appropriate features for better classification of minority class. However, these studies are limited to imbalance that arise between the classes. In addition to between class imbalance, within class imbalance, along with large number of features, adds additional complexity and results in poor performance of the classifier. In the current study, we propose an effective distance based feature selection method (ED-Relief) that uses a sophisticated distance measure, in order to tackle simultaneous occurrence of between and within class imbalance. This method has been tested on a variety of simulated experiments and real life data sets and the results are compared with the traditional Relief method and some of the well known recent distance based feature selection methods. The results clearly show the superiority of the proposed effective distance based feature selection method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17(2-3):255–287

    Google Scholar 

  2. Alibeigi M, Hashemi S, Hamzeh A (2012) Dbfs: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets. Data Knowl Eng 81:67–103

    Article  Google Scholar 

  3. Almuallim H, Dietterich TG (1994) Learning boolean concepts in the presence of many irrelevant features. Artif Intell 69(1-2):279–305

    Article  MathSciNet  Google Scholar 

  4. Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29

    Article  Google Scholar 

  5. Ben-Bassat M (1982) Pattern recognition and reduction of dimensionality. Handbook of Statistics 2 (1982):773–910

    Article  MathSciNet  Google Scholar 

  6. Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1-2):245–271

    Article  MathSciNet  Google Scholar 

  7. Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159

    Article  Google Scholar 

  8. Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6:1–6

    Article  Google Scholar 

  9. Chen XW, Wasikowski M (2008) Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 124–132

  10. Cleofas-Sánchez L, García V, Marqués A, Sánchez JS (2016) Financial distress prediction using the hybrid associative memory with translation. Applied Soft Computing 44:144–152

    Article  Google Scholar 

  11. Cover TM, Thomas JA (2012) Elements of information theory. Wiley, New York

    MATH  Google Scholar 

  12. Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36

    Article  MathSciNet  Google Scholar 

  13. Feng L, Wang H, Jin B, Li H, Xue M, Wang L (2018) Learning a distance metric by balancing kl-divergence for imbalanced datasets. IEEE Trans Syst Man Cybern Syst 99:1–12

    Google Scholar 

  14. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3(Mar):1289–1305

    MATH  Google Scholar 

  15. Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631

    Article  MathSciNet  Google Scholar 

  16. Guo H, Viktor HL (2004) Boosting with data generation: improving the classification of hard to learn examples. In: International conference on industrial, engineering and other applications of applied intelligent systems. Springer, pp 1082–1091

  17. Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explorations Newsletter 6(1):30–39

    Article  Google Scholar 

  18. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3 (Mar):1157–1182

    MATH  Google Scholar 

  19. Hall MA (2000) Correlation-based feature selection of discrete and numeric class machine learning

  20. He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, 2008. IJCNN 2008. (IEEE world congress on computational intelligence). IEEE, pp 1322–1328

  21. He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. (9):1263–1284

  22. Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310

    Article  Google Scholar 

  23. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intelligent Data Analysis 6(5):429–449

    Article  Google Scholar 

  24. Kira K, Rendell LA (1992) The feature selection problem: Traditional methods and a new algorithm. In: Aaai, vol 2, pp 129–134

  25. Li J, Cheng K, Wang S, Morstatter F, Robert T, Tang J, Liu H (2016) Feature selection: a data perspective. arXiv:1601.07996

  26. Ling CX, Li C (1998) Data mining for direct marketing: Problems and solutions. In: Kdd, vol 98, pp 73–79

  27. Liu H, Motoda H (2012) Feature selection for knowledge discovery and data mining, vol 454. Springer Science & Business Media, Berlin

    Google Scholar 

  28. Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246

    Article  Google Scholar 

  29. Mladenic D (1998) Machine learning on non-homogeneous, distributed text data. Computer Science, University of Ljubljana, Slovenia

  30. Moayedikia A, Ong KL, Boo YL, Yeoh WG, Jensen R (2017) Feature selection for high dimensional imbalanced class data using harmony search. Eng Appl Artif Intell 57:38–49

    Article  Google Scholar 

  31. Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter 6(1):50–59

    Article  Google Scholar 

  32. Piras L, Giacinto G (2012) Synthetic pattern generation for imbalanced learning in image retrieval. Pattern Recogn Lett 33(16):2198–2205

    Article  Google Scholar 

  33. Provost FJ, Fawcett T, et al. (1997) Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In: KDD, vol 97, pp 43–48

  34. Richardson A (2010) Nonparametric statistics for non-statisticians: a step-by-step approach by Gregory W. Corder, Dale I. Foreman. Int Stat Rev 78(3):451–452

    Article  Google Scholar 

  35. Rodrigues D, Pereira LA, Nakamura RY, Costa KA, Yang XS, Souza AN, Papa JP (2014) A wrapper approach for feature selection and optimum-path forest based on bat algorithm. Expert Systems with Applications 41(5):2250–2258

    Article  Google Scholar 

  36. Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1):1–47

    Article  MathSciNet  Google Scholar 

  37. Shang C, Li M, Feng S, Jiang Q, Fan J (2013) Feature selection via maximizing global information gain for text classification. Knowl-Based Syst 54:298–309

    Article  Google Scholar 

  38. Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378

    Article  Google Scholar 

  39. Tao D, Li X, Wu X, Maybank SJ (2009) Geometric mean for subspace selection. IEEE Trans Pattern Anal Mach Intell 31(2):260–274

    Article  Google Scholar 

  40. Tax DM, Duin RP (2004) Support vector data description. Mach Learn 54(1):45–66

    Article  Google Scholar 

  41. Tharwat A (2018) Classification assessment methods. Applied Computing and Informatics

  42. Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2009) Feature selection with high-dimensional imbalanced data. In: 2009 IEEE international conference on data mining workshops. IEEE, pp 507–514

  43. Van Rijn JN, Bischl B, Torgo L, Gao B, Umaashankar V, Fischer S, Winter P, Wiswedel B, Berthold MR, Vanschoren J (2013) Openml: a collaborative science platform. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 645–649

  44. Viegas F, Rocha L, Gonçalves M, Mourão F, Sá G, Salles T, Andrade G, Sandin I (2018) A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569

    Article  Google Scholar 

  45. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Icml, vol 97, pp 412–420

  46. Yang Y, Wang X, Liu Q, Xu M, Yu L (2015) A bundled-optimization model of multiview dense depth map synthesis for dynamic scene reconstruction. Inf Sci 320:306–319

    Article  MathSciNet  Google Scholar 

  47. Yin L, Ge Y, Xiao K, Wang X, Quan X (2013) Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11

    Article  Google Scholar 

  48. Yoon H, Yang K, Shahabi C (2005) Feature subset selection and feature ranking for multivariate time series. IEEE Trans Knowl Data Eng 17(9):1186–1198

    Article  Google Scholar 

  49. Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM Sigkdd Explorations Newsletter 6(1):80–89

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Usha Ananthakumar.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shahee, S.A., Ananthakumar, U. An effective distance based feature selection approach for imbalanced data. Appl Intell 50, 717–745 (2020). https://doi.org/10.1007/s10489-019-01543-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01543-z

Keywords

Navigation