skip to main content
10.1145/3568199.3568204acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmlmiConference Proceedingsconference-collections
research-article

Solving Imbalanced Data in Credit Risk Prediction: A Comparison of Resampling Strategies for Different Machine Learning Classification Algorithms, Taking Threshold Tuning into Account

Published:06 March 2023Publication History

ABSTRACT

Effective credit risk prediction is critical for commercial banks to actively manage their lending book and reduce negative impact from potential credit losses. In a benign credit markets where default rates are low, the datasets for customers’ credit default status prediction are often imbalanced. While many studies have been conducted to address the class imbalance problem in the machine learning field under various contexts, our research focuses on the credit risk sector. We experiment five resampling strategies (SMOTE, Borderline-SMOTE, Random Undersampling, NearMiss, SMOTE + Tomek Links) and a thresholding technique (Optimal G-Mean) on eight different classifiers separately. We also explore whether the combination of resampling and thresholding can enhance the overall classifier performance. The performance of models is evaluated on both threshold metrics (G-Mean, F1-score) and ranking metrics (area under the receiver operating characteristic (ROC) curve). Our findings suggest that pure threshold tuning can often outperforms resampling methods whereas the effect of thresholding on the basis of resampled dataset is minor.

References

  1. Abrol, R., Ghose, R., Dawson, S., Moran, M., Rakova, N., & Haas, R. (2022). Banks Must Act on their Early Warning Systems or Risk ROE Downturn. Retrieved from https://www.galytix.com/Content/ews_paper.pdfGoogle ScholarGoogle Scholar
  2. Altman, E. (2020). Covid-19 and the credit cycle. The Journal Of Credit Risk. doi: 10.21314/jcr.2020.262Google ScholarGoogle ScholarCross RefCross Ref
  3. ANTONSSON, H. (2018). Macroeconomic factors in Probability of Default A study applied to a Swedish credit portfolio. KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT. Retrieved from http://www.diva-portal.org/smash/get/diva2:1264976/FULLTEXT02.pdfGoogle ScholarGoogle Scholar
  4. Basel Committee on Banking Supervision. (2000). Principles for the Management of Credit Risk. Basel: Basel Committee on Banking Supervision. Retrieved from https://www.bis.org/publ/bcbs75.pdfGoogle ScholarGoogle Scholar
  5. Batista, G., Bazzan, A., & Monard, M. (2003). Balancing Training Data for Automated Annotation of Keywords: a Case Study. Retrieved from http://Balancing Training Data for Automated Annotation of Keywords: a Case StudyGoogle ScholarGoogle Scholar
  6. Bekkar, M., Djemaa, H., & Alitouche, T. (2013). Evaluation Measures for Models Assessment over Imbalanced Data Sets. Journal Of Information Engineering And Applications, 3(10). Retrieved from https://eva.fing.edu.uy/pluginfile.php/69453/mod_resource/content/1/7633-10048-1-PB.pdfGoogle ScholarGoogle Scholar
  7. Birla, S., Kohli, K., & Dutta, A. (2016). Machine Learning on imbalanced data in Credit Risk. 2016 IEEE 7Th Annual Information Technology, Electronics And Mobile Communication Conference (IEMCON). doi: 10.1109/iemcon.2016.7746326Google ScholarGoogle ScholarCross RefCross Ref
  8. Breiman, L. (2001). Random Forests. Machine Learning, 45(5), 32. doi: https://doi.org/10.1023/A:1010933404324Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal Of Artificial Intelligence Research, 16, 321-357. doi: 10.1613/jair.953Google ScholarGoogle ScholarCross RefCross Ref
  10. Drehmann, M., Sorensen, S., & Stringa, M. (2010). The integrated impact of credit and interest rate risk on banks: A dynamic framework and stress testing application. Journal Of Banking &Amp; Finance, 34(4), 713-729. doi: 10.1016/j.jbankfin.2009.06.009Google ScholarGoogle ScholarCross RefCross Ref
  11. Esposito, C., Landrum, G., Schneider, N., Stiefl, N., & Riniker, S. (2021). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning. Journal Of Chemical Information And Modeling, 61(6), 2623-2640. doi: 10.1021/acs.jcim.1c00160Google ScholarGoogle ScholarCross RefCross Ref
  12. Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30(1), 27-38. doi: 10.1016/j.patrec.2008.08.010Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gulati, P. (2020). Hybrid Resampling Technique to Tackle the Imbalanced Classification Problem. doi: 10.21203/rs.3.rs-36578/v1Google ScholarGoogle ScholarCross RefCross Ref
  14. Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN Model-Based Approach in Classification. On The Move To Meaningful Internet Systems 2003: Coopis, DOA, And ODBASE, 986-996. doi: 10.1007/978-3-540-39964-3_62Google ScholarGoogle ScholarCross RefCross Ref
  15. Han, H., Wang, W., & Mao, B. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Lecture Notes In Computer Science, 878-887. doi: 10.1007/11538059_91Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeni, L., Cohn, J., & De La Torre, F. (2013). Facing Imbalanced Data–Recommendations for the Use of Performance Metrics. 2013 Humaine Association Conference On Affective Computing And Intelligent Interaction. doi: 10.1109/acii.2013.47Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Johnson, J., & Khoshgoftaar, T. (2020). Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data. Advances In Intelligent Systems And Computing, 199-227. doi: 10.1007/978-981-15-6759-9_9Google ScholarGoogle ScholarCross RefCross Ref
  18. Johnson, J., & Khoshgoftaar, T. (2021). Robust Thresholding Strategies for Highly Imbalanced and Noisy Data. 2021 20Th IEEE International Conference On Machine Learning And Applications (ICMLA). doi: 10.1109/icmla52953.2021.00192Google ScholarGoogle ScholarCross RefCross Ref
  19. Ri, J., & Kim, H. (2020). G-mean based extreme learning machine for imbalance learning. Digital Signal Processing, 98, 102637. doi: 10.1016/j.dsp.2019.102637Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yeh, I., & Lien, C. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems With Applications, 36(2), 2473-2480. doi: 10.1016/j.eswa.2007.12.020Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yen, S., & Lee, Y. Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset. Intelligent Control And Automation, 731-740. doi: 10.1007/978-3-540-37256-1_89Google ScholarGoogle ScholarCross RefCross Ref
  22. Zou, Q., Xie, S., Lin, Z., Wu, M., & Ju, Y. (2016). Finding the Best Classification Threshold in Imbalanced Classification. Big Data Research, 5, 2-8. doi: 10.1016/j.bdr.2015.12.001Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Solving Imbalanced Data in Credit Risk Prediction: A Comparison of Resampling Strategies for Different Machine Learning Classification Algorithms, Taking Threshold Tuning into Account

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      MLMI '22: Proceedings of the 2022 5th International Conference on Machine Learning and Machine Intelligence
      September 2022
      215 pages
      ISBN:9781450397551
      DOI:10.1145/3568199

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 March 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)54
      • Downloads (Last 6 weeks)4

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format