Abstract
In this article, a novel undersampling method based on linear discriminant analysis (LDA) and Markov selective sampling (MSS) is proposed. This method contains two stages. The first stage is to adjust the position of classification boundary according to the G-mean of LDA classifier for many times. The second stage is to extract the “important” training samples from the current majority class by MSS. We apply the proposed undersampling method to Xgboost and study its learning performance. The experimental results of binary class datasets show that compared to other methods, Xgboost based on LDAMSS (X-LDAMSS) not only has better performance in three metrics (F-measure, G-mean, and AUC), but also has less total time. We also apply X-LDAMSS to multi-classification problem and present some useful discussions.
Similar content being viewed by others
References
Zhu ZB, Song ZH (2010) Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis. Chem Eng Res Des 88(8):936–951
Wei W, Li JJ, Cao LB, Ou YM, Chen JH (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16(4):449–475
Czarnecki W, Rataj K (2015) Compounds activity prediction in large imbalanced datasets with substructural relations fingerprint and EEM. In: IEEE Trustcom/Big Data SE/ISPA, pp 192–192
Khalilia M, Chakraborty S, Popescu M (2011) Predicting disease risks from highly imbalanced data using random forest. Bmc Medical Inform Decis Making 11(1):51–51
Loy C, Xiang T, Gong S (2010) Stream-based active unusual event detection. In: Proceedings of the 10th asian conference on computer vision, pp 161–175
Das S, Datta S, Chaudhuri B (2018) Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recogn 81:674–693
Yen S, Lee Y (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Ng W, Hu J, Yeung D, Yin S, Roli F (2017) Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans Cybern 45(11):2402–2412
Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, pp 878–887
He H, Bai Y, Garcia E, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: International joint conference on neural networks, pp 1322–1328
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2008) Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-asia conference on advances in knowledge discovery & data mining, pp 475–482
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772
Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:448–452
Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recogn 102:107–262
Lin W, Tsai C, Hu Y, Hang J (2017) Clustering-based undersampling in class-imbalanced data. Inform Sci 409(410):17–26
Ren R, Yang Y, Sun L (2020) Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50(8):2465–2487
Guan H, Zhang Y, Xian M, Cheng H, Tang X (2020) SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51(3):1394–1409
Garcia V, Mollineda RA, Sanchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Applic 11(3):269–280
Visa S, Ralescu A (2003) Learning imbalanced and overlapping classes using fuzzy sets. In: International conference on machine learning, pp 94–104
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning, pp 179–186
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in europe, pp 63–66
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inform Sci 509:47–70
Sundarkumar G, Ravi V (2015) A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Eng Appl Artif Intel 37:368–377
Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256
Zhang F, Liu G, Li Z, Yan C, Jiang C (2019) GMM-based undersampling and its application for credit card craud detection. In: International joint conference on neural networks, pp 1–8
Liu Z, Cao W, Gao Z, Jiang B, Chen HC, Chang Y, Liu TY (2020) Self-paced ensemble for highly imbalanced massive data classification. In: IEEE international conference on data engineering, pp 841–852
Fukunaga K, Mantock J (1983) Nonparametric discriminant analysis. IEEE Trans Pattern Anal Mach Intell 6:671–678
Vapnik V (2003) Statistical learning theory. Ann Inst Stat Math 55(2):371–389
Xu J, Tang YY, Zou B, Xu ZB, Li LQ, Zhang BC (2015) The generalization ability of SVM classification based on Markov sampling. IEEE Trans Cybern 45(6):1169–1179
Roberts G O (2004) General state space Markov chains and MCMC algorithms. Probab Surv 1 (1):20–71
Qian MP, Gong GL (1998) Applied Random Processes. Peking University Press, Beijing
Yu H, Mu C, Sun C, Yang W, Yang X, Zuo X (2015) Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data. Knowl Based Syst 76:67–78
Pedregosa F, Varoquaux G (2013) Scikit-learn: Machine learning in Python. J Mach Learn Res 12(10):2825–2830
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Dey P, Nag K, Pal T, Pal N (2018) Regularizing multilayer perceptron for robustness. IEEE Trans Syst Man Cybern Syst 48(8):1255–1266
Kang B, Nguyen T (2019) Random forest with learned representations for semantic segmentation. IEEE Trans Image Process 28(7):3542–3555
Chang CC, Lin CJ (2011) LIBSVM : a library for support vector machines. https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/
Vong C, Du J (2020) Accurate and efficient sequential ensemble learning for highly imbalanced multi-class data. Neural Netw 128:268–278
Fernandes E, Carvalho A (2019) Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning. Inform Sci 494:141–154
Funding
This work is supported in part by NSFC grants 61772011, Open Project Foundation of Intelligent Information Processing Key Laboratory of Shanxi Province (No.CICIP2018002) and National Key Research and Development Program of China (NO.2020YFA0714200) (Corresponding Author: Bin Zou and Jie Xu).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liang, T., Xu, J., Zou, B. et al. LDAMSS: Fast and efficient undersampling method for imbalanced learning. Appl Intell 52, 6794–6811 (2022). https://doi.org/10.1007/s10489-021-02780-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02780-x