Skip to main content
Log in

LDAMSS: Fast and efficient undersampling method for imbalanced learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In this article, a novel undersampling method based on linear discriminant analysis (LDA) and Markov selective sampling (MSS) is proposed. This method contains two stages. The first stage is to adjust the position of classification boundary according to the G-mean of LDA classifier for many times. The second stage is to extract the “important” training samples from the current majority class by MSS. We apply the proposed undersampling method to Xgboost and study its learning performance. The experimental results of binary class datasets show that compared to other methods, Xgboost based on LDAMSS (X-LDAMSS) not only has better performance in three metrics (F-measure, G-mean, and AUC), but also has less total time. We also apply X-LDAMSS to multi-classification problem and present some useful discussions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Zhu ZB, Song ZH (2010) Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis. Chem Eng Res Des 88(8):936–951

    Article  Google Scholar 

  2. Wei W, Li JJ, Cao LB, Ou YM, Chen JH (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16(4):449–475

    Article  Google Scholar 

  3. Czarnecki W, Rataj K (2015) Compounds activity prediction in large imbalanced datasets with substructural relations fingerprint and EEM. In: IEEE Trustcom/Big Data SE/ISPA, pp 192–192

  4. Khalilia M, Chakraborty S, Popescu M (2011) Predicting disease risks from highly imbalanced data using random forest. Bmc Medical Inform Decis Making 11(1):51–51

    Article  Google Scholar 

  5. Loy C, Xiang T, Gong S (2010) Stream-based active unusual event detection. In: Proceedings of the 10th asian conference on computer vision, pp 161–175

  6. Das S, Datta S, Chaudhuri B (2018) Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recogn 81:674–693

    Article  Google Scholar 

  7. Yen S, Lee Y (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727

    Article  Google Scholar 

  8. Ng W, Hu J, Yeung D, Yin S, Roli F (2017) Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans Cybern 45(11):2402–2412

    Article  Google Scholar 

  9. Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, pp 878–887

  10. He H, Bai Y, Garcia E, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: International joint conference on neural networks, pp 1322–1328

  11. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2008) Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-asia conference on advances in knowledge discovery & data mining, pp 475–482

  12. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772

    MathSciNet  MATH  Google Scholar 

  13. Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:448–452

    MathSciNet  MATH  Google Scholar 

  14. Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recogn 102:107–262

    Article  Google Scholar 

  15. Lin W, Tsai C, Hu Y, Hang J (2017) Clustering-based undersampling in class-imbalanced data. Inform Sci 409(410):17–26

    Article  Google Scholar 

  16. Ren R, Yang Y, Sun L (2020) Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50(8):2465–2487

    Article  Google Scholar 

  17. Guan H, Zhang Y, Xian M, Cheng H, Tang X (2020) SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51(3):1394–1409

    Article  Google Scholar 

  18. Garcia V, Mollineda RA, Sanchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Applic 11(3):269–280

    Article  MathSciNet  Google Scholar 

  19. Visa S, Ralescu A (2003) Learning imbalanced and overlapping classes using fuzzy sets. In: International conference on machine learning, pp 94–104

  20. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning, pp 179–186

  21. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421

    Article  MathSciNet  Google Scholar 

  22. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in europe, pp 63–66

  23. Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inform Sci 509:47–70

    Article  Google Scholar 

  24. Sundarkumar G, Ravi V (2015) A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Eng Appl Artif Intel 37:368–377

    Article  Google Scholar 

  25. Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256

    Article  MathSciNet  Google Scholar 

  26. Zhang F, Liu G, Li Z, Yan C, Jiang C (2019) GMM-based undersampling and its application for credit card craud detection. In: International joint conference on neural networks, pp 1–8

  27. Liu Z, Cao W, Gao Z, Jiang B, Chen HC, Chang Y, Liu TY (2020) Self-paced ensemble for highly imbalanced massive data classification. In: IEEE international conference on data engineering, pp 841–852

  28. Fukunaga K, Mantock J (1983) Nonparametric discriminant analysis. IEEE Trans Pattern Anal Mach Intell 6:671–678

    Article  Google Scholar 

  29. Vapnik V (2003) Statistical learning theory. Ann Inst Stat Math 55(2):371–389

    MathSciNet  Google Scholar 

  30. Xu J, Tang YY, Zou B, Xu ZB, Li LQ, Zhang BC (2015) The generalization ability of SVM classification based on Markov sampling. IEEE Trans Cybern 45(6):1169–1179

    Article  Google Scholar 

  31. Roberts G O (2004) General state space Markov chains and MCMC algorithms. Probab Surv 1 (1):20–71

    MathSciNet  MATH  Google Scholar 

  32. Qian MP, Gong GL (1998) Applied Random Processes. Peking University Press, Beijing

    Google Scholar 

  33. Yu H, Mu C, Sun C, Yang W, Yang X, Zuo X (2015) Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data. Knowl Based Syst 76:67–78

    Article  Google Scholar 

  34. Pedregosa F, Varoquaux G (2013) Scikit-learn: Machine learning in Python. J Mach Learn Res 12(10):2825–2830

    MathSciNet  MATH  Google Scholar 

  35. Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  36. Dey P, Nag K, Pal T, Pal N (2018) Regularizing multilayer perceptron for robustness. IEEE Trans Syst Man Cybern Syst 48(8):1255–1266

    Article  Google Scholar 

  37. Kang B, Nguyen T (2019) Random forest with learned representations for semantic segmentation. IEEE Trans Image Process 28(7):3542–3555

    Article  MathSciNet  Google Scholar 

  38. Chang CC, Lin CJ (2011) LIBSVM : a library for support vector machines. https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/

  39. Vong C, Du J (2020) Accurate and efficient sequential ensemble learning for highly imbalanced multi-class data. Neural Netw 128:268–278

    Article  Google Scholar 

  40. Fernandes E, Carvalho A (2019) Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning. Inform Sci 494:141–154

    Article  Google Scholar 

Download references

Funding

This work is supported in part by NSFC grants 61772011, Open Project Foundation of Intelligent Information Processing Key Laboratory of Shanxi Province (No.CICIP2018002) and National Key Research and Development Program of China (NO.2020YFA0714200) (Corresponding Author: Bin Zou and Jie Xu).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jie Xu or Bin Zou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, T., Xu, J., Zou, B. et al. LDAMSS: Fast and efficient undersampling method for imbalanced learning. Appl Intell 52, 6794–6811 (2022). https://doi.org/10.1007/s10489-021-02780-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02780-x

Keywords

Navigation