research-article

Solving Imbalanced Data in Credit Risk Prediction: A Comparison of Resampling Strategies for Different Machine Learning Classification Algorithms, Taking Threshold Tuning into Account

Authors:
Chenyu Yang

The University of Hong Kong, China

The University of Hong Kong, China

0000-0003-2011-9757
View Profile

,
Yanjie Dong

Shanghai United International School, China

Shanghai United International School, China

0000-0002-5941-4596
View Profile

,
Jiachen Lu

Culver Academy, USA

Culver Academy, USA

0000-0003-4149-6666
View Profile

,
Zherui Peng

Wuhan Britain China School, China

Wuhan Britain China School, China

0000-0002-6713-8683
View Profile

MLMI '22: Proceedings of the 2022 5th International Conference on Machine Learning and Machine IntelligenceSeptember 2022Pages 30–40https://doi.org/10.1145/3568199.3568204

Published:06 March 2023Publication History

MLMI '22: Proceedings of the 2022 5th International Conference on Machine Learning and Machine Intelligence

Pages 30–40

ABSTRACT

Effective credit risk prediction is critical for commercial banks to actively manage their lending book and reduce negative impact from potential credit losses. In a benign credit markets where default rates are low, the datasets for customers’ credit default status prediction are often imbalanced. While many studies have been conducted to address the class imbalance problem in the machine learning field under various contexts, our research focuses on the credit risk sector. We experiment five resampling strategies (SMOTE, Borderline-SMOTE, Random Undersampling, NearMiss, SMOTE + Tomek Links) and a thresholding technique (Optimal G-Mean) on eight different classifiers separately. We also explore whether the combination of resampling and thresholding can enhance the overall classifier performance. The performance of models is evaluated on both threshold metrics (G-Mean, F1-score) and ranking metrics (area under the receiver operating characteristic (ROC) curve). Our findings suggest that pure threshold tuning can often outperforms resampling methods whereas the effect of thresholding on the basis of resampled dataset is minor.

References

Abrol, R., Ghose, R., Dawson, S., Moran, M., Rakova, N., & Haas, R. (2022). Banks Must Act on their Early Warning Systems or Risk ROE Downturn. Retrieved from https://www.galytix.com/Content/ews_paper.pdfGoogle Scholar
Altman, E. (2020). Covid-19 and the credit cycle. The Journal Of Credit Risk. doi: 10.21314/jcr.2020.262Google ScholarCross Ref
ANTONSSON, H. (2018). Macroeconomic factors in Probability of Default A study applied to a Swedish credit portfolio. KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT. Retrieved from http://www.diva-portal.org/smash/get/diva2:1264976/FULLTEXT02.pdfGoogle Scholar
Basel Committee on Banking Supervision. (2000). Principles for the Management of Credit Risk. Basel: Basel Committee on Banking Supervision. Retrieved from https://www.bis.org/publ/bcbs75.pdfGoogle Scholar
Batista, G., Bazzan, A., & Monard, M. (2003). Balancing Training Data for Automated Annotation of Keywords: a Case Study. Retrieved from http://Balancing Training Data for Automated Annotation of Keywords: a Case StudyGoogle Scholar
Bekkar, M., Djemaa, H., & Alitouche, T. (2013). Evaluation Measures for Models Assessment over Imbalanced Data Sets. Journal Of Information Engineering And Applications, 3(10). Retrieved from https://eva.fing.edu.uy/pluginfile.php/69453/mod_resource/content/1/7633-10048-1-PB.pdfGoogle Scholar
Birla, S., Kohli, K., & Dutta, A. (2016). Machine Learning on imbalanced data in Credit Risk. 2016 IEEE 7Th Annual Information Technology, Electronics And Mobile Communication Conference (IEMCON). doi: 10.1109/iemcon.2016.7746326Google ScholarCross Ref
Breiman, L. (2001). Random Forests. Machine Learning, 45(5), 32. doi: https://doi.org/10.1023/A:1010933404324Google ScholarDigital Library
Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal Of Artificial Intelligence Research, 16, 321-357. doi: 10.1613/jair.953Google ScholarCross Ref
Drehmann, M., Sorensen, S., & Stringa, M. (2010). The integrated impact of credit and interest rate risk on banks: A dynamic framework and stress testing application. Journal Of Banking &Amp; Finance, 34(4), 713-729. doi: 10.1016/j.jbankfin.2009.06.009Google ScholarCross Ref
Esposito, C., Landrum, G., Schneider, N., Stiefl, N., & Riniker, S. (2021). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning. Journal Of Chemical Information And Modeling, 61(6), 2623-2640. doi: 10.1021/acs.jcim.1c00160Google ScholarCross Ref
Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30(1), 27-38. doi: 10.1016/j.patrec.2008.08.010Google ScholarDigital Library
Gulati, P. (2020). Hybrid Resampling Technique to Tackle the Imbalanced Classification Problem. doi: 10.21203/rs.3.rs-36578/v1Google ScholarCross Ref
Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN Model-Based Approach in Classification. On The Move To Meaningful Internet Systems 2003: Coopis, DOA, And ODBASE, 986-996. doi: 10.1007/978-3-540-39964-3_62Google ScholarCross Ref
Han, H., Wang, W., & Mao, B. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Lecture Notes In Computer Science, 878-887. doi: 10.1007/11538059_91Google ScholarDigital Library
Jeni, L., Cohn, J., & De La Torre, F. (2013). Facing Imbalanced Data–Recommendations for the Use of Performance Metrics. 2013 Humaine Association Conference On Affective Computing And Intelligent Interaction. doi: 10.1109/acii.2013.47Google ScholarDigital Library
Johnson, J., & Khoshgoftaar, T. (2020). Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data. Advances In Intelligent Systems And Computing, 199-227. doi: 10.1007/978-981-15-6759-9_9Google ScholarCross Ref
Johnson, J., & Khoshgoftaar, T. (2021). Robust Thresholding Strategies for Highly Imbalanced and Noisy Data. 2021 20Th IEEE International Conference On Machine Learning And Applications (ICMLA). doi: 10.1109/icmla52953.2021.00192Google ScholarCross Ref
Ri, J., & Kim, H. (2020). G-mean based extreme learning machine for imbalance learning. Digital Signal Processing, 98, 102637. doi: 10.1016/j.dsp.2019.102637Google ScholarDigital Library
Yeh, I., & Lien, C. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems With Applications, 36(2), 2473-2480. doi: 10.1016/j.eswa.2007.12.020Google ScholarDigital Library
Yen, S., & Lee, Y. Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset. Intelligent Control And Automation, 731-740. doi: 10.1007/978-3-540-37256-1_89Google ScholarCross Ref
Zou, Q., Xie, S., Lin, Z., Wu, M., & Ju, Y. (2016). Finding the Best Classification Threshold in Imbalanced Classification. Big Data Research, 5, 2-8. doi: 10.1016/j.bdr.2015.12.001Google ScholarCross Ref

Index Terms

Solving Imbalanced Data in Credit Risk Prediction: A Comparison of Resampling Strategies for Different Machine Learning Classification Algorithms, Taking Threshold Tuning into Account
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Over-sampling via under-sampling in strongly imbalanced data

Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
Read More
A Combination of Resampling Method and Machine Learning for Text Classification on Imbalanced Data
Artificial Intelligence and Mobile Services – AIMS 2021
Abstract
Imbalanced data will affect the accuracy of text classification, in order to solve this issue, 11 different algorithms are used to resampling the dataset. Results show that, 5 different oversampling method and SmoteTomek method can rebalance the ...
Read More
Highly imbalanced fault classification of wind turbines using data resampling and hybrid ensemble method approach
Abstract
Deep learning-based incipient fault diagnostic techniques have achieved surprisingly well in wind turbines. Due to component failures, wind turbines must undergo active maintenance, substantially influencing revenue and power generation. ...
Highlights
- To gain complete knowledge, characteristics of SCADA data are obtained using data analysis.
- We present the adaptive SMOTE-ENN algorithm-based data resampling technique to deal with imbalanced data.
- Create a hybrid ensemble ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MLMI '22: Proceedings of the 2022 5th International Conference on Machine Learning and Machine Intelligence
September 2022
215 pages
ISBN:9781450397551
DOI:10.1145/3568199

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 March 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Credit Risk
Imbalanced data
Machine Learning
Resampling
Threshold Tuning
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 65
  Total Downloads
- Downloads (Last 12 months)54
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Solving Imbalanced Data in Credit Risk Prediction: A Comparison of Resampling Strategies for Different Machine Learning Classification Algorithms, Taking Threshold Tuning into Account

MLMI '22: Proceedings of the 2022 5th International Conference on Machine Learning and Machine Intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

Over-sampling via under-sampling in strongly imbalanced data

A Combination of Resampling Method and Machine Learning for Text Classification on Imbalanced Data

Highly imbalanced fault classification of wind turbines using data resampling and hybrid ensemble method approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Solving Imbalanced Data in Credit Risk Prediction: A Comparison of Resampling Strategies for Different Machine Learning Classification Algorithms, Taking Threshold Tuning into Account

MLMI '22: Proceedings of the 2022 5th International Conference on Machine Learning and Machine Intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

Over-sampling via under-sampling in strongly imbalanced data

A Combination of Resampling Method and Machine Learning for Text Classification on Imbalanced Data

Highly imbalanced fault classification of wind turbines using data resampling and hybrid ensemble method approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media