A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets

Han, Xu; Cui, Runbang; Lan, Yanfei; Kang, Yanzhe; Deng, Jiang; Jia, Ning

doi:10.1007/s13042-019-00953-2

A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets

Original Article
Published: 08 May 2019

Volume 10, pages 3687–3699, (2019)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Xu Han¹,
Runbang Cui²,
Yanfei Lan¹,
Yanzhe Kang¹,
Jiang Deng² &
…
Ning Jia¹

1102 Accesses
27 Citations
Explore all metrics

Abstract

Credit scoring represents a two-classification problem. Moreover, the data imbalance of the credit data sets, where one class contains a small number of data samples and the other contains a large number of data samples, is an often problem. Therefore, if only a traditional classifier is used to classify the data, the final classification effect will be affected. To improve the classification of the credit data sets, a Gaussian mixture model based combined resampling algorithm is proposed. This resampling approach first determines the number of samples of the majority class and the minority class using a sampling factor. Then, the Gaussian mixture clustering is used for undersampling of the majority of samples, and the synthetic minority oversampling technique is used for the rest of the samples, so an eventual imbalance problem is eliminated. Here we compare several resampling methods commonly used in the analysis of imbalanced credit data sets. The obtained experimental results demonstrate that the proposed method consistently improves classification performances such as F-measure, AUC, G-mean, and so on. In addition, the method has strong robustness for credit data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

Machine learning techniques for credit risk evaluation: a systematic literature review

Article 01 April 2020

References

Albisua I, Arbelaitz O, Gurrutxaga I, Lasarguren A, Muguerza J, Pérez JM (2013) The quest for the optimal class distribution: an approach for enhancing the effectiveness of learning via resampling methods for imbalanced data sets. Prog Artif Intell 2(1):45–63
Article Google Scholar
Altman EI, Marco G, Varetto F (2004) Corporate distress diagnosis: comparisons using linear discriminant analysis and neural networks (the Italian experience). J Bank Financ 18(3):505–529
Article Google Scholar
Arminger G, Enache D, Bonne T (1997) Analyzing credit risk data: a comparison of logistic discrimination, classification tree analysis, and feedforward networks. Comput Stat 12(2):293–310
MATH Google Scholar
Baesens B, Gestel TV, Viaene S, Stepanova M, Suykens J, Vanthienen J (2003) Benchmarking state-of-the-art classification algorithms for credit scoring. J Oper Res Soc 54(6):627–635
Article MATH Google Scholar
Baesens B, Mues C, Martens D, Vanthienen J (2009) 50 years of data mining and OR: upcoming trends and challenges. J Oper Res Soc 60(1):S16–S23
Article MATH Google Scholar
Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognit 48(5):1653–1672
Article Google Scholar
Błaszczyński J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150:529–542
Article Google Scholar
Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl 39(3):3446–3453
Article Google Scholar
Chawla NV (2009) Data mining for imbalanced datasets: an overview. In: Data mining and knowledge discovery handbook. Springer, Boston, MA, pp 875–886
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Article MATH Google Scholar
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252
Article MathSciNet Google Scholar
Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1–6
Article Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. Lect Notes Comput Sci 2838:107–119
Article Google Scholar
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: IEEE international conference on granular computing, IEEE. Atlanta, USA
Cohen WW (1995) Fast effective rule induction. In: Twelfth international conference on machine learning. Morgan Kaufmann Publishers Inc. Tahoe City, California, pp 115–123
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol 39(1):1–22
MathSciNet MATH Google Scholar
Desai VS, Crook JN, Jr GO (1996) A comparison of neural networks and linear scoring models in the credit union environment. Eur J Oper Res 95(1):24–37
Article MATH Google Scholar
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD’99 proceedings of the ifth ACM SIGKDD international conference on knowledge discovery and data mining. San Diego, USA, vol 99, pp 155–164
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874
Article MathSciNet Google Scholar
Freitas A (2011) Building cost-sensitive decision trees for medical applications. AI Commun 24(3):285–287
Article Google Scholar
Galar M, Barrenechea E, Herrera F (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit 46(12):3460–3471
Article Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484
Article Google Scholar
García V, Marqués AI, Sánchez JS (2012) On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Syst Appl 39(18):13267–13276
Article Google Scholar
Ghazikhani A, Monsefi R, Yazdi HS (2013) Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122:535–544
Article Google Scholar
Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B (2016) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg. Ulsan, Korea, pp 878–887
Hand DJ, Henley WE (1997) Statistical classification methods in consumer credit scoring: a review. J R Stat Soc 160(3):523–541
Article Google Scholar
Hartigan JA, Wong MA (1979) Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc 28(1):100–108
MATH Google Scholar
Hu S, Liang Y, Ma L, He Y (2009) MSMOTE: improving classification performance when training data is imbalanced. In: 2009 second international workshop on computer science and engineering, IEEE. Qingdao, China, vol 2, pp 13–17
Huang Z, Chen H, Hsu CJ, Chen WH, Wu S (2004) Credit rating analysis with support vector machines and neural networks: a market comparative study. Decis Support Syst 37(4):543–558
Article Google Scholar
Jackowski K, Krawczyk B, Woźniak M (2012) Cost-sensitive splitting and selection method for medical decision support system. In: Intelligent data engineering and automated learning—IDEAL 2012. Springer, Berlin
Li DC, Liu CW, Hu SC (2010) A learning method for the class imbalance problem with medical data sets. Comput Biol Med 40(5):509–518
Article Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Article MATH Google Scholar
Kasabov N (2002) Evolving connectionist systems for adaptive learning and knowledge discovery: methods, tools, applications. In: Proceedings first international IEEE symposium intelligent systems, IEEE. Varna, Bulgaria, vol 1, pp 24–28
Kasabov N, Feigin V, Hou ZG, Chen Y, Liang L, Krishnamurthi R, Parmar P (2014) Evolving spiking neural networks for personalised modelling, classification and prediction of spatio-temporal patterns with a case study on stroke. Neurocomputing 134(4):269–279
Article Google Scholar
Kasabov NK, Doborjeh MG, Doborjeh ZG (2016) Mapping, learning, visualization, classification, and understanding of fMRI data in the NeuCube evolving spatiotemporal data machine of spiking neural networks. IEEE Trans Neural Netw Learn Syst PP(99):887–899
Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: The international joint conference on artiicial intelligence, Morgan Kaufmann. Los Angeles, CA, vol 14, no 2, pp 1137–1145
Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36
Google Scholar
Krawczyk B, Woniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14(1):554–562
Article Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: the 14th international conference on machine learning. Nashville, TN, USA, vol 97, pp 179–186
Lenca P, Lallich S (2008) A comparison of different off-centered entropies to deal with class imbalance for decision trees. Lect Notes Comput Sci 5012:634–643
Article Google Scholar
Li Y, Sun G, Zhu Y (2010) Data imbalance problem in text classification. In: 2010 third international symposium on information processing, IEEE. Qingdao, China, pp 301–305
Lin Y, Huang X, Xu K (2013) Research on extreme risk warning for financial market based on RU-SMOTE-SVM. Forecasting 32(4)
Liu TY (2012) Feature selection based on mutual information for gear imbalanced problem faulty diagnosis. In: IET conference publications, 2012, pp 54–54. https://doi.org/10.1049/cp.2012.0506
Liu W, Chawla S (2011) Class confidence weighted kNN algorithms for imbalanced data sets. In: Computer science. https://doi.org/10.1007/978-3-642-20847-8, pp 345–356 (chapter 29)
Liu W, Chawla S, Cieslak DA, Chawla NV (2010) A robust decision tree algorithm for imbalanced data sets. In: Paper presented at the SIAM international conference on data mining, SDM 2010, April 29–May 1, 2010, Columbus, Ohio, USA
Lomax S, Vadera S (2013) A survey of cost-sensitive decision tree induction algorithms. ACM Comput Surv 45(2):1–35
Article MATH Google Scholar
Maalouf M, Trafalis TB (2011) Robust weighted kernel logistic regression in imbalanced and rare events data. Comput Stat Data Anal 55(1):168–183
Article MathSciNet MATH Google Scholar
Marqués AI, García V, Sánchez JS (2013) On the suitability of resampling techniques for the class imbalance problem in credit scoring. J Oper Res Soc 64(7):1060–1070
Article Google Scholar
Mena L, Gonzalez JA (2006) Machine learning for imbalanced datasets: application in medical diagnostic. In: Paper presented at the nineteenth international Florida artificial intelligence research society conference, Melbourne Beach, Florida, USA, May
Min F, Zhu W (2012) A competition strategy to cost-sensitive decision trees. Springer, Berlin
Book Google Scholar
Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Financ 23(4):589–609
Article Google Scholar
Perols J (2013) Financial statement fraud detection: an analysis of statistical and machine learning algorithms. Soc Sci Electron Publ 30(2):19–50
Google Scholar
Pluto K, Tasche D (2005) Estimating probabilities of default for low default portfolios. Dirk Tasche 6(3):79–103
Google Scholar
Rodda S, Mogalla S (2011) A normalized measure for estimating classification rules for multi-class imbalanced datasets. Int J Eng Sci Technol 3(4):3216–3220
Google Scholar
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(20):53–65
Article MATH Google Scholar
Steenackers A, Goovaerts MJ (1989) A credit scoring model for personal loans. Insur Math Econ 8(1):31–34
Article Google Scholar
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Article MATH Google Scholar
Thomas C (2013) Improving intrusion detection for imbalanced network traffic. Secur Commun Netw 6(3):309–324
Article Google Scholar
Thomas LC, Crook J, Edelman D (2002) Credit scoring and its applications. SIAM, Philadelphia
Book MATH Google Scholar
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern SMC 6(11):769–772
MathSciNet MATH Google Scholar
Wang G, Hao J, Ma J, Jiang H (2011) A comparative assessment of ensemble learning for credit scoring. Expert Syst Appl 38(1):223–230
Article Google Scholar
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining, IEEE. Nashville, TN, USA, pp 324–331
West D (2000) Neural network credit scoring models. Comput Oper Res 27(11):1131–1152
Article MATH Google Scholar
Wiginton JC (1980) A note on the comparison of logit and discriminant models of consumer credit behavior. J Financ Quant Anal 15(3):757–770
Article Google Scholar
Yang Y (2007) Adaptive credit scoring with kernel learning methods. Eur J Oper Res 183(3):1521–1536
Article MATH Google Scholar
Yobas MB, Crook JN, Ross P (2000) Credit scoring using neural and evolutionary techniques. IMA J Manag Math 11(2):111–125
Article MathSciNet MATH Google Scholar
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. Sigkdd Explor 6(1):80–89
Article Google Scholar

Download references

Acknowledgements

The authors are grateful to the support of the National Natural Science Foundation of China (71671123, 71571132). Meanwhile, the author is grateful for the help of relevant enterprises and professors in the process.

Author information

Authors and Affiliations

College of Management and Economics, Tianjin University, Tianjin, 300072, China
Xu Han, Yanfei Lan, Yanzhe Kang & Ning Jia
QingDao Fantaike Technology Co., Ltd, Qingdao, China
Runbang Cui & Jiang Deng

Authors

Xu Han
View author publications
You can also search for this author in PubMed Google Scholar
Runbang Cui
View author publications
You can also search for this author in PubMed Google Scholar
Yanfei Lan
View author publications
You can also search for this author in PubMed Google Scholar
Yanzhe Kang
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Deng
View author publications
You can also search for this author in PubMed Google Scholar
Ning Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ning Jia.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, X., Cui, R., Lan, Y. et al. A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets. Int. J. Mach. Learn. & Cyber. 10, 3687–3699 (2019). https://doi.org/10.1007/s13042-019-00953-2

Download citation

Received: 19 April 2018
Accepted: 23 April 2019
Published: 08 May 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s13042-019-00953-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Machine learning techniques for credit risk evaluation: a systematic literature review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Machine learning techniques for credit risk evaluation: a systematic literature review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation