ABSTRACT
Effective credit risk prediction is critical for commercial banks to actively manage their lending book and reduce negative impact from potential credit losses. In a benign credit markets where default rates are low, the datasets for customers’ credit default status prediction are often imbalanced. While many studies have been conducted to address the class imbalance problem in the machine learning field under various contexts, our research focuses on the credit risk sector. We experiment five resampling strategies (SMOTE, Borderline-SMOTE, Random Undersampling, NearMiss, SMOTE + Tomek Links) and a thresholding technique (Optimal G-Mean) on eight different classifiers separately. We also explore whether the combination of resampling and thresholding can enhance the overall classifier performance. The performance of models is evaluated on both threshold metrics (G-Mean, F1-score) and ranking metrics (area under the receiver operating characteristic (ROC) curve). Our findings suggest that pure threshold tuning can often outperforms resampling methods whereas the effect of thresholding on the basis of resampled dataset is minor.
- Abrol, R., Ghose, R., Dawson, S., Moran, M., Rakova, N., & Haas, R. (2022). Banks Must Act on their Early Warning Systems or Risk ROE Downturn. Retrieved from https://www.galytix.com/Content/ews_paper.pdfGoogle Scholar
- Altman, E. (2020). Covid-19 and the credit cycle. The Journal Of Credit Risk. doi: 10.21314/jcr.2020.262Google ScholarCross Ref
- ANTONSSON, H. (2018). Macroeconomic factors in Probability of Default A study applied to a Swedish credit portfolio. KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT. Retrieved from http://www.diva-portal.org/smash/get/diva2:1264976/FULLTEXT02.pdfGoogle Scholar
- Basel Committee on Banking Supervision. (2000). Principles for the Management of Credit Risk. Basel: Basel Committee on Banking Supervision. Retrieved from https://www.bis.org/publ/bcbs75.pdfGoogle Scholar
- Batista, G., Bazzan, A., & Monard, M. (2003). Balancing Training Data for Automated Annotation of Keywords: a Case Study. Retrieved from http://Balancing Training Data for Automated Annotation of Keywords: a Case StudyGoogle Scholar
- Bekkar, M., Djemaa, H., & Alitouche, T. (2013). Evaluation Measures for Models Assessment over Imbalanced Data Sets. Journal Of Information Engineering And Applications, 3(10). Retrieved from https://eva.fing.edu.uy/pluginfile.php/69453/mod_resource/content/1/7633-10048-1-PB.pdfGoogle Scholar
- Birla, S., Kohli, K., & Dutta, A. (2016). Machine Learning on imbalanced data in Credit Risk. 2016 IEEE 7Th Annual Information Technology, Electronics And Mobile Communication Conference (IEMCON). doi: 10.1109/iemcon.2016.7746326Google ScholarCross Ref
- Breiman, L. (2001). Random Forests. Machine Learning, 45(5), 32. doi: https://doi.org/10.1023/A:1010933404324Google ScholarDigital Library
- Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal Of Artificial Intelligence Research, 16, 321-357. doi: 10.1613/jair.953Google ScholarCross Ref
- Drehmann, M., Sorensen, S., & Stringa, M. (2010). The integrated impact of credit and interest rate risk on banks: A dynamic framework and stress testing application. Journal Of Banking &Amp; Finance, 34(4), 713-729. doi: 10.1016/j.jbankfin.2009.06.009Google ScholarCross Ref
- Esposito, C., Landrum, G., Schneider, N., Stiefl, N., & Riniker, S. (2021). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning. Journal Of Chemical Information And Modeling, 61(6), 2623-2640. doi: 10.1021/acs.jcim.1c00160Google ScholarCross Ref
- Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30(1), 27-38. doi: 10.1016/j.patrec.2008.08.010Google ScholarDigital Library
- Gulati, P. (2020). Hybrid Resampling Technique to Tackle the Imbalanced Classification Problem. doi: 10.21203/rs.3.rs-36578/v1Google ScholarCross Ref
- Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN Model-Based Approach in Classification. On The Move To Meaningful Internet Systems 2003: Coopis, DOA, And ODBASE, 986-996. doi: 10.1007/978-3-540-39964-3_62Google ScholarCross Ref
- Han, H., Wang, W., & Mao, B. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Lecture Notes In Computer Science, 878-887. doi: 10.1007/11538059_91Google ScholarDigital Library
- Jeni, L., Cohn, J., & De La Torre, F. (2013). Facing Imbalanced Data–Recommendations for the Use of Performance Metrics. 2013 Humaine Association Conference On Affective Computing And Intelligent Interaction. doi: 10.1109/acii.2013.47Google ScholarDigital Library
- Johnson, J., & Khoshgoftaar, T. (2020). Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data. Advances In Intelligent Systems And Computing, 199-227. doi: 10.1007/978-981-15-6759-9_9Google ScholarCross Ref
- Johnson, J., & Khoshgoftaar, T. (2021). Robust Thresholding Strategies for Highly Imbalanced and Noisy Data. 2021 20Th IEEE International Conference On Machine Learning And Applications (ICMLA). doi: 10.1109/icmla52953.2021.00192Google ScholarCross Ref
- Ri, J., & Kim, H. (2020). G-mean based extreme learning machine for imbalance learning. Digital Signal Processing, 98, 102637. doi: 10.1016/j.dsp.2019.102637Google ScholarDigital Library
- Yeh, I., & Lien, C. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems With Applications, 36(2), 2473-2480. doi: 10.1016/j.eswa.2007.12.020Google ScholarDigital Library
- Yen, S., & Lee, Y. Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset. Intelligent Control And Automation, 731-740. doi: 10.1007/978-3-540-37256-1_89Google ScholarCross Ref
- Zou, Q., Xie, S., Lin, Z., Wu, M., & Ju, Y. (2016). Finding the Best Classification Threshold in Imbalanced Classification. Big Data Research, 5, 2-8. doi: 10.1016/j.bdr.2015.12.001Google ScholarCross Ref
Index Terms
- Solving Imbalanced Data in Credit Risk Prediction: A Comparison of Resampling Strategies for Different Machine Learning Classification Algorithms, Taking Threshold Tuning into Account
Recommendations
Over-sampling via under-sampling in strongly imbalanced data
Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
A Combination of Resampling Method and Machine Learning for Text Classification on Imbalanced Data
Artificial Intelligence and Mobile Services – AIMS 2021AbstractImbalanced data will affect the accuracy of text classification, in order to solve this issue, 11 different algorithms are used to resampling the dataset. Results show that, 5 different oversampling method and SmoteTomek method can rebalance the ...
Highly imbalanced fault classification of wind turbines using data resampling and hybrid ensemble method approach
AbstractDeep learning-based incipient fault diagnostic techniques have achieved surprisingly well in wind turbines. Due to component failures, wind turbines must undergo active maintenance, substantially influencing revenue and power generation. ...
Highlights- To gain complete knowledge, characteristics of SCADA data are obtained using data analysis.
- We present the adaptive SMOTE-ENN algorithm-based data resampling technique to deal with imbalanced data.
- Create a hybrid ensemble ...
Comments