What should lenders be more concerned about? Developing a profit-driven loan default prediction model

https://doi.org/10.1016/j.eswa.2022.118938Get rights and content

Highlights

  • A novel profit-driven model is proposed to predict loan default.

  • Bayesian optimization is used to optimize the hyperparameters of the CBT.

  • Profit metric is taken as the optimization objective of the Bayesian optimization.

  • SHAP value is calculated to provide interpretable prediction results.

Abstract

Reliable and effective loan default risk prediction can help regulators and lenders effectively identify risky loan applicants and develop proactive and timely response measures to enhance the stability of the financial system. Traditional prediction models concentrate more on improving loan default prediction accuracy, while neglecting to take profit maximization as the goal and evaluation measure of model construction. In this study, a novel profit-driven prediction model is proposed, taking a profit indicator as the optimization objective of the Bayesian optimization to optimize the hyperparameters of the predictor-categorical boosting. The Shapley additive explanations (SHAP) value is then calculated to further interpret the relationship between the input variables and the predicted values. Based on two datasets from Renrendai and Lending Club, the experimental results and statistical test indicate that the proposed model achieves the highest profit-related evaluation metrics values, with the mean average extra profit rate values of 3.0872% and 2.1858% respectively, and the mean Profit values of 5168.8762 and 352.9787 in two datasets respectively. SHAP value further reveals the key factors that will impact predictive output, which provides more valuable information for platforms and lenders for identifying possible defaulters.

Introduction

Online lending is an innovative form of credit that eliminates the need for financial intermediaries, instead directly releasing the loans through the online lending platform (Ouyang, Zhi, & Wu, 2021). Online lending has seen explosive growth in recent years because of its flexibility, accessibility, and characteristic of decreasing financing costs (Guo, Zhou, Luo, Liu, & Xiong, 2016). The main characteristic of the online lending market is that smaller loan orders are requested relative to traditional loan orders, but with larger numbers of customers. When several borrowers default on their loans, online lending platforms may face cash flow troubles, which become a major threat to their management and operation (Zhang, Wang, Zhang, & Wang, 2020). Unfortunately, due to imperfect information, most online lending platforms perform poorly in credit quality evaluation and loan default risk management. To better manage loan default risk and increase stakeholders’ profits, it is important to assess the default risk of a loan application when platforms provide loans for borrowers.

Generally, loan default risk is defined as the risk of loss due to the borrower's failure to fulfill the loan contract on time (Yuan, Chi, Zhou, & Yin, 2022). Loan default is detrimental to the profits of lenders and the development of the lending platform or financial intermediary. To this end, default risk prediction is widely developed to explore the relationship between the attributes of historical data and the potential default status to effectively identify loan defaulters and provide valuable references for financial institutions, platforms, and lenders (Geng, Bose, & Chen, 2015). However, in the process of default prediction, once a defaulter is wrongly taken as a non-defaulter, it will incur huge economic losses; while a non-defaulter is classified as a defaulter, it will disrupt the pool of high-quality customers (Papouskova & Hajek, 2019). Thus, it is crucial to establish a reliable default prediction model to accurately identify default customers and maximize lenders’ profit.

Specifically, evaluating whether a borrower will default is essentially a binary classification problem. In the past few decades, numerous forecasting technologies appliable to solving binary classification problems have been proposed for loan default forecasting, such as statistical models and machine learning models. Commonly-used statistical models in loan default prediction include logistic regression (LR) (Wiginton, 1980) and discriminant analysis (Khemais, Nesrine, & Mohamed, 2016). Statistical models are popular prediction devices because they can provide accurate and easily interpreted prediction results by virtue of simple functional forms. However, the prediction performance of these models is susceptible to the interference of changes in the economic environment and credit market, such as the increase in default rates (Moscatelli, Parlapiano, Narizzano, & Viggiano, 2020). Aside from this, statistical models show poor performance in measuring the nonlinear relationship among economic, financial, and credit variables. A growing body of literature indicates that the complex nonlinear interactions between forecasting models and output can be successfully modeled by machine learning models. Machine learning methods forgo evaluating the importance of influence factors for default risk while presenting accurate out-of-sample prediction results (Baesens et al., 2003). Support vector machine (SVM) (Lv, Wang, Niu, & Lu, 2022), random forest (RF), artificial neural networks (ANNs) (Gao et al., 2022, Wang et al., 2022), and gradient boosting decision trees (GBDTs) are available machine learning methods, among which GBDTs have drawn great attention in the default prediction domain. This is because the high predictive performance of most machine learning methods comes at the expense of interpretability and intelligibility, while GBDTs can fulfill the objective of high transparency and intelligibility as well as provide satisfactory prediction performance. Light gradient boosting machine (LGBM), extreme gradient boosting (XGBT), and categorical boosting (CBT) are several frequently used GBDTs in loan default prediction. Compared to LGBM and XGBT, CBT involves two significant improvements, namely a powerful scheme for categorical features and unbiased estimation of the gradient step, making it outperform both LGBM and XGBT (Prokhorenkova, Gusev, Vorobev, Dorogush, & Gulin, 2018). Moreover, these two improvements enhance the performance of CBT in handing categorical features, shorten the layout time, and avoid overfitting problems. Thus, CBT is a suitable alternative for predicting default risk. However, the prediction performance of CBT is significantly affected by the setting of hyperparameters. Although CBT was used to predict loan default in some existing literature, the importance of parameter optimization is often ignored or the parameters are mostly determined by grid search or experience (Qi et al., 2021, Tounsi et al., 2020), which may affect the prediction performance. The emergence of optimization algorithms overcomes this drawback (Niu et al., 2022, Wang et al., 2021). Some studies have used Bayesian optimization (BO) to estimate the hyperparameter of CBT (Xia et al., 2020, Zhou et al., 2021, Xia et al., 2020). However, the setting of the objective function of the BO algorithm is based on an accuracy indicator, ignoring that the core goal of loan default prediction is to reduce financial loss and maximize the lender's profit.

According to the above analysis, some challenges for current loan default prediction technologies exist. First, stakeholders prefer making more profits by differentiating good borrowers from bad ones rather than identifying defaulters more accurately (Ye, Dong, & Ma, 2018), which means that more profit-oriented rather than accuracy-oriented loan default prediction strategies should be explored. Second, most previous researchers have paid more attention to profitability evaluation (Garrido et al., 2018, Verbraken et al., 2014), with few studies incorporating profit maximization into the process of constructing prediction models. Further, no study integrates a profit-based measure and optimization algorithm into the process of the training of CBT. Inspired by these challenges, this study proposes a profit-driven loan default prediction system consisting of a CBT and BO algorithm (Profit-BOCBT), aiming to provide a new perspective for effective loan default prediction, satisfy the core goal of profit maximum for lenders, help decision makers realize better default risk management, and improve credit market efficiency.

The primary innovation of this study lies in that we propose a new profit-driven loan default prediction model, which innovatively incorporates profit-based metric into the guidance for the training process of the CBT. Specifically, a profit-based measure, namely the average extra profit rate (APR), is set as the learning objective of the BO algorithm to optimize the core hyperparameters of the CBT in the training process. The prediction values are then associated with the sample feature values based on the Shapley additive explanations (SHAP) value to provide interpretable prediction results. The experimental results demonstrate that the proposed model achieves high-quality prediction performance with the mean APR values of 3.0872 % and 2.1858 %, respectively, and the mean Profit values of 5168.8762 and 352.9787 in datasets 1 and 2, respectively. Thus, the proposed Profit-BOCBT can earn more profits for stakeholders and provide valuable references for lenders and lending institutions.

The remainder of this paper is organized as follows. Section 2 presents the literature review on default prediction methods based on data-driven methods and profit measure. Section 3 introduces the methodologies used in the proposed Profit-BOCBT system. Section 4 shows the experimental setup and corresponding results analysis, and Section 5 concludes the paper.

Section snippets

Literature review

Two themes in the literature, including data-driven and profit-related approaches in loan default prediction, are reviewed in this section.

Methodology

In this section, we introduce the methodologies used in the proposed loan default prediction system, including CBT, BO algorithm, and prediction performance measures. The detailed flow of the proposed Profit-BOCBT is given in Fig. 1.

Experimental setup and result analysis

In this section, the data description, experimental setup, simulation results, statistical test, interpretability analysis, computational efficiency, and practical suggestions are explained in detail.

Conclusion

This study proposed a novel perspective for profit-driven loan default prediction, that is, using BO to optimize the hyperparameters of CBT, and the optimization objective is innovatively set as a profit indicator (i.e., APR). The prediction performance of the proposed model is compared with ten frequently used prediction models using two datasets from Renrendai and Lending Club in two aspects: accuracy and profit. Moreover, SHAP values of input variables for the proposed model in Dataset 1 are

CRediT authorship contribution statement

Lifang Zhang: Writing – original draft. Jianzhou Wang: Writing – review & editing. Zhenkun Liu: Software, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by Major Program of National Social Science Foundation of China (Grant No. 17ZDA093).

References (68)

  • P.M. Addo et al.

    Credit risk analysis using machine and deep learning models

    Risks

    (2018)
  • M.S. Alam et al.

    Bayesian optimization algorithm based support vector regression analysis for estimation of shear capacity of FRP reinforced concrete members

    Applied Soft Computing

    (2021)
  • B. Baesens et al.

    Benchmarking state-of-the-art classification algorithms for credit scoring

    Journal of the Operational Research Society

    (2003)
  • Barua, S., Gavandi, D., Sangle, P., Shinde, L., & Ramteke, J. (2021). Swindle: Predicting the Probability of Loan...
  • L. Breiman

    Random forests

    Machine Learning

    (2001)
  • L. Breiman et al.

    Classification and Regression Trees (Wadsworth Statistics/Probability)

    (1984)
  • Byanjankar, A., Heikkila, M., & Mezei, J. (2015). Predicting credit risk in peer-to-peer lending: A neural network...
  • S. Chen et al.

    Modeling default risk with support vector machines

    Quantitative Finance

    (2011)
  • T. Chen et al.

    XGBoost: A scalable tree boosting system

  • B.V. Dasarathy

    Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques

    (1991)
  • P. Domingos et al.

    On the optimality of the simple Bayesian classifier under zero-one loss

    Machine Learning

    (1997)
  • R.A. Eisenbeis

    Pitfalls in the application of discriminant analysis in business, finance, and economics

    The Journal of Finance.

    (1977)
  • T. Fitzpatrick et al.

    How can lenders prosper? Comparing machine learning approaches to identify profitable peer-to-peer loan investments

    European Journal of Operational Research.

    (2021)
  • Y. Gao et al.

    A multi-component hybrid system based on predictability recognition and modified multi-objective optimization for ultra-short-term onshore wind speed forecasting

    Renewable Energy

    (2022)
  • F. Garrido et al.

    A Robust profit measure for binary classification model evaluation

    Expert Systems with Applications

    (2018)
  • R. Geng et al.

    Prediction of financial distress: An empirical study of listed Chinese companies using data mining

    European Journal of Operational Research

    (2015)
  • Y. Guo et al.

    Instance-based credit risk assessment for investment decisions in P2P lending

    European Journal of Operational Research

    (2016)
  • S. Hamori et al.

    Ensemble learning or deep learning? Application to default risk analysis

    Journal of Risk and Financial Management

    (2018)
  • T. Harris

    Credit scoring using the clustered support vector machine

    Expert Systems with Applications

    (2015)
  • H. He et al.

    A novel hybrid ensemble model based on tree-based method and deep learning method for default prediction

    Expert Systems with Applications

    (2021)
  • T. He et al.

    Accelerating multi-layer perceptron based short term demand forecasting using graphics processing units

    Transmission and Distribution Conference and Exposition: Asia and Pacific, T and D Asia

    (2009)
  • G. Huang et al.

    Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions

    Journal of Hydrology

    (2019)
  • A.A. Ibrahim et al.

    Comparison of the CatBoost classifier with other machine learning methods

    International Journal of Advanced Computer Science and Applications.

    (2020)
  • G. Ke et al.

    LightGBM: A highly efficient gradient boosting decision tree

    (2017)
  • Z. Khemais et al.

    Credit scoring and default risk prediction: A comparative study between discriminant analysis & logistic regression

    International Journal of Economics and Finance

    (2016)
  • N. Kozodoi et al.

    A multi-objective approach for profit-driven feature selection in credit scoring

    Decision Support Systems

    (2019)
  • M. Li et al.

    The network loan risk prediction model based on Convolutional neural network and Stacking fusion model

    Applied Soft Computing

    (2021)
  • L. Liang et al.

    Forecasting peer-to-peer platform default rate with LSTM neural network

    Electronic Commerce Research and Applications

    (2020)
  • W. Liu et al.

    Credit scoring based on tree-enhanced gradient boosting decision trees

    Expert Systems with Applications

    (2022)
  • J. López et al.

    Profit-based credit scoring based on robust optimization and feature selection

    Information Sciences

    (2019)
  • M. Lv et al.

    A newly combination model based on data denoising strategy and advanced optimization algorithm for short-term wind speed prediction

    Journal of Ambient Intelligence and Humanized Computing

    (2022)
  • X. Ma et al.

    Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

    Electronic Commerce Research and Applications

    (2018)
  • S. Maldonado et al.

    Integrated framework for profit-based feature selection and SVM classification in credit scoring

    Decision Support Systems

    (2017)
  • M. Moscatelli et al.

    Corporate default forecasting with machine learning

    Expert Systems with Applications

    (2020)
  • Cited by (15)

    • Development and application of a hybrid forecasting framework based on improved extreme learning machine for enterprise financing risk

      2023, Expert Systems with Applications
      Citation Excerpt :

      Therefore, the introduction of multi-source data into the enterprise financing risk forecasting model is an important research direction in the future, which is worth exploring in the future. Moreover, in the future, the proposed forecasting framework can be explored to other fields, such as ventilation diagnosis (Glowacz, 2021b), loan default forecasting (Zhang, Wang, & Liu, 2023), patients no-show prediction (Fan, Deng, Ye, & Wang, 2021) and fault diagnosis (Glowacz, 2019; Glowacz, 2021a; Glowacz et al., 2021). Zongguo Ma: Conceptualization, Methodology, Formal analysis, Supervision, Writing – original draft, Funding acquisition, Project administration.

    • A novel combined model for probabilistic load forecasting based on deep learning and improved optimizer

      2023, Energy
      Citation Excerpt :

      There are many hyperparameters of QRCNNbiLSTM that need to be determined, such as the number of hidden layers and hidden layer units. We introduce Bayesian theory [5] to realize hyperparameter optimization to reinforce our model, so that appropriate hyperparameter values can be captured to improve the performance of the model on the testing set. The Bayesian optimization (BO) [52] can be implemented as follows:

    View all citing articles on Scopus
    View full text