LDAMSS: Fast and efficient undersampling method for imbalanced learning

Liang, Ting; Xu, Jie; Zou, Bin; Wang, Zhan; Zeng, Jingjing

doi:10.1007/s10489-021-02780-x

LDAMSS: Fast and efficient undersampling method for imbalanced learning

Published: 16 September 2021

Volume 52, pages 6794–6811, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ting Liang¹,
Jie Xu¹,
Bin Zou²,
Zhan Wang¹ &
…
Jingjing Zeng³

507 Accesses
5 Citations
Explore all metrics

Abstract

In this article, a novel undersampling method based on linear discriminant analysis (LDA) and Markov selective sampling (MSS) is proposed. This method contains two stages. The first stage is to adjust the position of classification boundary according to the G-mean of LDA classifier for many times. The second stage is to extract the “important” training samples from the current majority class by MSS. We apply the proposed undersampling method to Xgboost and study its learning performance. The experimental results of binary class datasets show that compared to other methods, Xgboost based on LDAMSS (X-LDAMSS) not only has better performance in three metrics (F-measure, G-mean, and AUC), but also has less total time. We also apply X-LDAMSS to multi-classification problem and present some useful discussions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Article 10 May 2023

To combat multi-class imbalanced problems by means of over-sampling and boosting techniques

Article 30 April 2014

GMMSampling: a new model-based, data difficulty-driven resampling method for multi-class imbalanced data

Article Open access 20 November 2023

References

Zhu ZB, Song ZH (2010) Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis. Chem Eng Res Des 88(8):936–951
Article Google Scholar
Wei W, Li JJ, Cao LB, Ou YM, Chen JH (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16(4):449–475
Article Google Scholar
Czarnecki W, Rataj K (2015) Compounds activity prediction in large imbalanced datasets with substructural relations fingerprint and EEM. In: IEEE Trustcom/Big Data SE/ISPA, pp 192–192
Khalilia M, Chakraborty S, Popescu M (2011) Predicting disease risks from highly imbalanced data using random forest. Bmc Medical Inform Decis Making 11(1):51–51
Article Google Scholar
Loy C, Xiang T, Gong S (2010) Stream-based active unusual event detection. In: Proceedings of the 10th asian conference on computer vision, pp 161–175
Das S, Datta S, Chaudhuri B (2018) Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recogn 81:674–693
Article Google Scholar
Yen S, Lee Y (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Article Google Scholar
Ng W, Hu J, Yeung D, Yin S, Roli F (2017) Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans Cybern 45(11):2402–2412
Article Google Scholar
Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, pp 878–887
He H, Bai Y, Garcia E, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: International joint conference on neural networks, pp 1322–1328
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2008) Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-asia conference on advances in knowledge discovery & data mining, pp 475–482
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772
MathSciNet MATH Google Scholar
Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:448–452
MathSciNet MATH Google Scholar
Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recogn 102:107–262
Article Google Scholar
Lin W, Tsai C, Hu Y, Hang J (2017) Clustering-based undersampling in class-imbalanced data. Inform Sci 409(410):17–26
Article Google Scholar
Ren R, Yang Y, Sun L (2020) Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50(8):2465–2487
Article Google Scholar
Guan H, Zhang Y, Xian M, Cheng H, Tang X (2020) SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51(3):1394–1409
Article Google Scholar
Garcia V, Mollineda RA, Sanchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Applic 11(3):269–280
Article MathSciNet Google Scholar
Visa S, Ralescu A (2003) Learning imbalanced and overlapping classes using fuzzy sets. In: International conference on machine learning, pp 94–104
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning, pp 179–186
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421
Article MathSciNet Google Scholar
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in europe, pp 63–66
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inform Sci 509:47–70
Article Google Scholar
Sundarkumar G, Ravi V (2015) A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Eng Appl Artif Intel 37:368–377
Article Google Scholar
Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256
Article MathSciNet Google Scholar
Zhang F, Liu G, Li Z, Yan C, Jiang C (2019) GMM-based undersampling and its application for credit card craud detection. In: International joint conference on neural networks, pp 1–8
Liu Z, Cao W, Gao Z, Jiang B, Chen HC, Chang Y, Liu TY (2020) Self-paced ensemble for highly imbalanced massive data classification. In: IEEE international conference on data engineering, pp 841–852
Fukunaga K, Mantock J (1983) Nonparametric discriminant analysis. IEEE Trans Pattern Anal Mach Intell 6:671–678
Article Google Scholar
Vapnik V (2003) Statistical learning theory. Ann Inst Stat Math 55(2):371–389
MathSciNet Google Scholar
Xu J, Tang YY, Zou B, Xu ZB, Li LQ, Zhang BC (2015) The generalization ability of SVM classification based on Markov sampling. IEEE Trans Cybern 45(6):1169–1179
Article Google Scholar
Roberts G O (2004) General state space Markov chains and MCMC algorithms. Probab Surv 1 (1):20–71
MathSciNet MATH Google Scholar
Qian MP, Gong GL (1998) Applied Random Processes. Peking University Press, Beijing
Google Scholar
Yu H, Mu C, Sun C, Yang W, Yang X, Zuo X (2015) Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data. Knowl Based Syst 76:67–78
Article Google Scholar
Pedregosa F, Varoquaux G (2013) Scikit-learn: Machine learning in Python. J Mach Learn Res 12(10):2825–2830
MathSciNet MATH Google Scholar
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Dey P, Nag K, Pal T, Pal N (2018) Regularizing multilayer perceptron for robustness. IEEE Trans Syst Man Cybern Syst 48(8):1255–1266
Article Google Scholar
Kang B, Nguyen T (2019) Random forest with learned representations for semantic segmentation. IEEE Trans Image Process 28(7):3542–3555
Article MathSciNet Google Scholar
Chang CC, Lin CJ (2011) LIBSVM : a library for support vector machines. https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/
Vong C, Du J (2020) Accurate and efficient sequential ensemble learning for highly imbalanced multi-class data. Neural Netw 128:268–278
Article Google Scholar
Fernandes E, Carvalho A (2019) Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning. Inform Sci 494:141–154
Article Google Scholar

Download references

Funding

This work is supported in part by NSFC grants 61772011, Open Project Foundation of Intelligent Information Processing Key Laboratory of Shanxi Province (No.CICIP2018002) and National Key Research and Development Program of China (NO.2020YFA0714200) (Corresponding Author: Bin Zou and Jie Xu).

Author information

Authors and Affiliations

Faculty of Computer Science and Information Engineering, Hubei University, Wuhan, 430062, China
Ting Liang, Jie Xu & Zhan Wang
Faculty of Mathematics and Statistics, Hubei Key Laboratory of Applied Mathematics, Hubei University, Wuhan, 430062, China
Bin Zou
Faculty of Mathematics and Statistics, Hubei University, Wuhan, 430062, China
Jingjing Zeng

Authors

Ting Liang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Xu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Zou
View author publications
You can also search for this author in PubMed Google Scholar
Zhan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jie Xu or Bin Zou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liang, T., Xu, J., Zou, B. et al. LDAMSS: Fast and efficient undersampling method for imbalanced learning. Appl Intell 52, 6794–6811 (2022). https://doi.org/10.1007/s10489-021-02780-x

Download citation

Accepted: 18 August 2021
Published: 16 September 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10489-021-02780-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LDAMSS: Fast and efficient undersampling method for imbalanced learning

Abstract

Access this article

Similar content being viewed by others

OUBoost: boosting based over and under sampling technique for handling imbalanced data

To combat multi-class imbalanced problems by means of over-sampling and boosting techniques

GMMSampling: a new model-based, data difficulty-driven resampling method for multi-class imbalanced data

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LDAMSS: Fast and efficient undersampling method for imbalanced learning

Abstract

Access this article

Similar content being viewed by others

OUBoost: boosting based over and under sampling technique for handling imbalanced data

To combat multi-class imbalanced problems by means of over-sampling and boosting techniques

GMMSampling: a new model-based, data difficulty-driven resampling method for multi-class imbalanced data

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation