What should lenders be more concerned about? Developing a profit-driven loan default prediction model

doi:10.1016/j.eswa.2022.118938

Expert Systems with Applications

Volume 213, Part B, 1 March 2023, 118938

https://doi.org/10.1016/j.eswa.2022.118938 Get rights and content

Highlights

•
A novel profit-driven model is proposed to predict loan default.
•
Bayesian optimization is used to optimize the hyperparameters of the CBT.
•
Profit metric is taken as the optimization objective of the Bayesian optimization.
•
SHAP value is calculated to provide interpretable prediction results.

Abstract

Reliable and effective loan default risk prediction can help regulators and lenders effectively identify risky loan applicants and develop proactive and timely response measures to enhance the stability of the financial system. Traditional prediction models concentrate more on improving loan default prediction accuracy, while neglecting to take profit maximization as the goal and evaluation measure of model construction. In this study, a novel profit-driven prediction model is proposed, taking a profit indicator as the optimization objective of the Bayesian optimization to optimize the hyperparameters of the predictor-categorical boosting. The Shapley additive explanations (SHAP) value is then calculated to further interpret the relationship between the input variables and the predicted values. Based on two datasets from Renrendai and Lending Club, the experimental results and statistical test indicate that the proposed model achieves the highest profit-related evaluation metrics values, with the mean average extra profit rate values of 3.0872% and 2.1858% respectively, and the mean Profit values of 5168.8762 and 352.9787 in two datasets respectively. SHAP value further reveals the key factors that will impact predictive output, which provides more valuable information for platforms and lenders for identifying possible defaulters.

Introduction

Online lending is an innovative form of credit that eliminates the need for financial intermediaries, instead directly releasing the loans through the online lending platform (Ouyang, Zhi, & Wu, 2021). Online lending has seen explosive growth in recent years because of its flexibility, accessibility, and characteristic of decreasing financing costs (Guo, Zhou, Luo, Liu, & Xiong, 2016). The main characteristic of the online lending market is that smaller loan orders are requested relative to traditional loan orders, but with larger numbers of customers. When several borrowers default on their loans, online lending platforms may face cash flow troubles, which become a major threat to their management and operation (Zhang, Wang, Zhang, & Wang, 2020). Unfortunately, due to imperfect information, most online lending platforms perform poorly in credit quality evaluation and loan default risk management. To better manage loan default risk and increase stakeholders’ profits, it is important to assess the default risk of a loan application when platforms provide loans for borrowers.

Generally, loan default risk is defined as the risk of loss due to the borrower's failure to fulfill the loan contract on time (Yuan, Chi, Zhou, & Yin, 2022). Loan default is detrimental to the profits of lenders and the development of the lending platform or financial intermediary. To this end, default risk prediction is widely developed to explore the relationship between the attributes of historical data and the potential default status to effectively identify loan defaulters and provide valuable references for financial institutions, platforms, and lenders (Geng, Bose, & Chen, 2015). However, in the process of default prediction, once a defaulter is wrongly taken as a non-defaulter, it will incur huge economic losses; while a non-defaulter is classified as a defaulter, it will disrupt the pool of high-quality customers (Papouskova & Hajek, 2019). Thus, it is crucial to establish a reliable default prediction model to accurately identify default customers and maximize lenders’ profit.

Specifically, evaluating whether a borrower will default is essentially a binary classification problem. In the past few decades, numerous forecasting technologies appliable to solving binary classification problems have been proposed for loan default forecasting, such as statistical models and machine learning models. Commonly-used statistical models in loan default prediction include logistic regression (LR) (Wiginton, 1980) and discriminant analysis (Khemais, Nesrine, & Mohamed, 2016). Statistical models are popular prediction devices because they can provide accurate and easily interpreted prediction results by virtue of simple functional forms. However, the prediction performance of these models is susceptible to the interference of changes in the economic environment and credit market, such as the increase in default rates (Moscatelli, Parlapiano, Narizzano, & Viggiano, 2020). Aside from this, statistical models show poor performance in measuring the nonlinear relationship among economic, financial, and credit variables. A growing body of literature indicates that the complex nonlinear interactions between forecasting models and output can be successfully modeled by machine learning models. Machine learning methods forgo evaluating the importance of influence factors for default risk while presenting accurate out-of-sample prediction results (Baesens et al., 2003). Support vector machine (SVM) (Lv, Wang, Niu, & Lu, 2022), random forest (RF), artificial neural networks (ANNs) (Gao et al., 2022, Wang et al., 2022), and gradient boosting decision trees (GBDTs) are available machine learning methods, among which GBDTs have drawn great attention in the default prediction domain. This is because the high predictive performance of most machine learning methods comes at the expense of interpretability and intelligibility, while GBDTs can fulfill the objective of high transparency and intelligibility as well as provide satisfactory prediction performance. Light gradient boosting machine (LGBM), extreme gradient boosting (XGBT), and categorical boosting (CBT) are several frequently used GBDTs in loan default prediction. Compared to LGBM and XGBT, CBT involves two significant improvements, namely a powerful scheme for categorical features and unbiased estimation of the gradient step, making it outperform both LGBM and XGBT (Prokhorenkova, Gusev, Vorobev, Dorogush, & Gulin, 2018). Moreover, these two improvements enhance the performance of CBT in handing categorical features, shorten the layout time, and avoid overfitting problems. Thus, CBT is a suitable alternative for predicting default risk. However, the prediction performance of CBT is significantly affected by the setting of hyperparameters. Although CBT was used to predict loan default in some existing literature, the importance of parameter optimization is often ignored or the parameters are mostly determined by grid search or experience (Qi et al., 2021, Tounsi et al., 2020), which may affect the prediction performance. The emergence of optimization algorithms overcomes this drawback (Niu et al., 2022, Wang et al., 2021). Some studies have used Bayesian optimization (BO) to estimate the hyperparameter of CBT (Xia et al., 2020, Zhou et al., 2021, Xia et al., 2020). However, the setting of the objective function of the BO algorithm is based on an accuracy indicator, ignoring that the core goal of loan default prediction is to reduce financial loss and maximize the lender's profit.

According to the above analysis, some challenges for current loan default prediction technologies exist. First, stakeholders prefer making more profits by differentiating good borrowers from bad ones rather than identifying defaulters more accurately (Ye, Dong, & Ma, 2018), which means that more profit-oriented rather than accuracy-oriented loan default prediction strategies should be explored. Second, most previous researchers have paid more attention to profitability evaluation (Garrido et al., 2018, Verbraken et al., 2014), with few studies incorporating profit maximization into the process of constructing prediction models. Further, no study integrates a profit-based measure and optimization algorithm into the process of the training of CBT. Inspired by these challenges, this study proposes a profit-driven loan default prediction system consisting of a CBT and BO algorithm (Profit-BOCBT), aiming to provide a new perspective for effective loan default prediction, satisfy the core goal of profit maximum for lenders, help decision makers realize better default risk management, and improve credit market efficiency.

The primary innovation of this study lies in that we propose a new profit-driven loan default prediction model, which innovatively incorporates profit-based metric into the guidance for the training process of the CBT. Specifically, a profit-based measure, namely the average extra profit rate (APR), is set as the learning objective of the BO algorithm to optimize the core hyperparameters of the CBT in the training process. The prediction values are then associated with the sample feature values based on the Shapley additive explanations (SHAP) value to provide interpretable prediction results. The experimental results demonstrate that the proposed model achieves high-quality prediction performance with the mean APR values of 3.0872 % and 2.1858 %, respectively, and the mean Profit values of 5168.8762 and 352.9787 in datasets 1 and 2, respectively. Thus, the proposed Profit-BOCBT can earn more profits for stakeholders and provide valuable references for lenders and lending institutions.

The remainder of this paper is organized as follows. Section 2 presents the literature review on default prediction methods based on data-driven methods and profit measure. Section 3 introduces the methodologies used in the proposed Profit-BOCBT system. Section 4 shows the experimental setup and corresponding results analysis, and Section 5 concludes the paper.

Section snippets

Literature review

Two themes in the literature, including data-driven and profit-related approaches in loan default prediction, are reviewed in this section.

Methodology

In this section, we introduce the methodologies used in the proposed loan default prediction system, including CBT, BO algorithm, and prediction performance measures. The detailed flow of the proposed Profit-BOCBT is given in Fig. 1.

Experimental setup and result analysis

In this section, the data description, experimental setup, simulation results, statistical test, interpretability analysis, computational efficiency, and practical suggestions are explained in detail.

Conclusion

This study proposed a novel perspective for profit-driven loan default prediction, that is, using BO to optimize the hyperparameters of CBT, and the optimization objective is innovatively set as a profit indicator (i.e., APR). The prediction performance of the proposed model is compared with ten frequently used prediction models using two datasets from Renrendai and Lending Club in two aspects: accuracy and profit. Moreover, SHAP values of input variables for the proposed model in Dataset 1 are

CRediT authorship contribution statement

Lifang Zhang: Writing – original draft. Jianzhou Wang: Writing – review & editing. Zhenkun Liu: Software, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by Major Program of National Social Science Foundation of China (Grant No. 17ZDA093).

References (68)

P.M. Addo et al.
Credit risk analysis using machine and deep learning models
Risks
(2018)
M.S. Alam et al.
Bayesian optimization algorithm based support vector regression analysis for estimation of shear capacity of FRP reinforced concrete members
Applied Soft Computing
(2021)
B. Baesens et al.
Benchmarking state-of-the-art classification algorithms for credit scoring
Journal of the Operational Research Society
(2003)
Barua, S., Gavandi, D., Sangle, P., Shinde, L., & Ramteke, J. (2021). Swindle: Predicting the Probability of Loan...
L. Breiman
Random forests
Machine Learning
(2001)
L. Breiman et al.
Classification and Regression Trees (Wadsworth Statistics/Probability)
(1984)
Byanjankar, A., Heikkila, M., & Mezei, J. (2015). Predicting credit risk in peer-to-peer lending: A neural network...
S. Chen et al.
Modeling default risk with support vector machines
Quantitative Finance
(2011)
T. Chen et al.
XGBoost: A scalable tree boosting system
B.V. Dasarathy
Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques
(1991)

P. Domingos et al.

On the optimality of the simple Bayesian classifier under zero-one loss

Machine Learning

(1997)

R.A. Eisenbeis

Pitfalls in the application of discriminant analysis in business, finance, and economics

The Journal of Finance.

(1977)

T. Fitzpatrick et al.

How can lenders prosper? Comparing machine learning approaches to identify profitable peer-to-peer loan investments

European Journal of Operational Research.

(2021)

Y. Gao et al.

A multi-component hybrid system based on predictability recognition and modified multi-objective optimization for ultra-short-term onshore wind speed forecasting

Renewable Energy

(2022)

F. Garrido et al.

A Robust profit measure for binary classification model evaluation

Expert Systems with Applications

(2018)

R. Geng et al.

Prediction of financial distress: An empirical study of listed Chinese companies using data mining

European Journal of Operational Research

(2015)

Y. Guo et al.

Instance-based credit risk assessment for investment decisions in P2P lending

European Journal of Operational Research

(2016)

S. Hamori et al.

Ensemble learning or deep learning? Application to default risk analysis

Journal of Risk and Financial Management

(2018)

T. Harris

Credit scoring using the clustered support vector machine

Expert Systems with Applications

(2015)

H. He et al.

A novel hybrid ensemble model based on tree-based method and deep learning method for default prediction

Expert Systems with Applications

(2021)

T. He et al.

Accelerating multi-layer perceptron based short term demand forecasting using graphics processing units

Transmission and Distribution Conference and Exposition: Asia and Pacific, T and D Asia

(2009)

G. Huang et al.

Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions

Journal of Hydrology

(2019)

A.A. Ibrahim et al.

Comparison of the CatBoost classifier with other machine learning methods

International Journal of Advanced Computer Science and Applications.

(2020)

G. Ke et al.

LightGBM: A highly efficient gradient boosting decision tree

(2017)

Z. Khemais et al.

Credit scoring and default risk prediction: A comparative study between discriminant analysis & logistic regression

International Journal of Economics and Finance

(2016)

N. Kozodoi et al.

A multi-objective approach for profit-driven feature selection in credit scoring

Decision Support Systems

(2019)

M. Li et al.

The network loan risk prediction model based on Convolutional neural network and Stacking fusion model

Applied Soft Computing

(2021)

L. Liang et al.

Forecasting peer-to-peer platform default rate with LSTM neural network

Electronic Commerce Research and Applications

(2020)

W. Liu et al.

Credit scoring based on tree-enhanced gradient boosting decision trees

Expert Systems with Applications

(2022)

J. López et al.

Profit-based credit scoring based on robust optimization and feature selection

Information Sciences

(2019)

M. Lv et al.

A newly combination model based on data denoising strategy and advanced optimization algorithm for short-term wind speed prediction

Journal of Ambient Intelligence and Humanized Computing

(2022)

X. Ma et al.

Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

Electronic Commerce Research and Applications

(2018)

S. Maldonado et al.

Integrated framework for profit-based feature selection and SVM classification in credit scoring

Decision Support Systems

(2017)

M. Moscatelli et al.

Corporate default forecasting with machine learning

Expert Systems with Applications

(2020)

Cited by (15)

Profit-driven weighted classifier with interpretable ability for customer churn prediction
2024, Omega (United Kingdom)
Customer churn prediction methods aim to identify customers with the highest probability of attrition, improve the effectiveness of customer retention campaigns, and maximize profits. However, previous studies have relied on a single classifier, leading to suboptimal predictive results. To address this issue, we propose a novel profit-driven weighted classifier that integrates a weighted strategy with multiple profit-driven ensemble members. We employ an artificial hummingbird optimization algorithm to determine the optimal weight coefficients of the profit-driven ensemble members based on the expected maximum profit criterion. We then calculate the Shapley additive explanation value to further improve the interpretability of the proposed weighted classifier. We conducted experiments and statistical tests on eight real-world datasets from different industries. The results show that the proposed weighted classifier significantly improves profits compared with comparative classifiers and provides strong interpretability based on the Shapley additive explanation value.
A combined system based on data preprocessing and optimization algorithm for electricity load forecasting
2024, Computers and Industrial Engineering
Creating steady models for predicting electricity load can enhance the equilibrium between power supply and demand, a critical factor in advancing precise distribution management and optimizing economic advantages at a granular level. Electricity load forecasting is a challenging research area, and the accuracy improvement of existing single-point load forecasting models is limited by the randomness and volatility of electricity load data. As such, this research introduces a combined system. Firstly, based on the optimized Variational Mode Decomposition method, the system utilizes the Tuna Optimization Algorithm to optimize two key parameters of VMD (the penalty factor α and the number of mode decomposition K) with the objective of minimizing the envelope entropy and obtaining smoother and more stable signals. Secondly, a combination model consisting of multiple single models is proposed, and the Chef-Based Optimization Algorithm is employed to search for the combination weights that minimize the prediction errors, thereby enhancing the precision and consistency of the predictive model. To validate the superiority of the combined system, experiments are conducted using electricity load data from Queensland, Australia, with a time interval of 5 min. The numerical findings demonstrate that the system not only exhibits a substantial performance advantage over the single model in various assessment criteria like mean absolute error and root mean square error but also confirm the efficacy of the proposed method.
Combined forecasting tool for renewable energy management in sustainable supply chains
2023, Computers and Industrial Engineering
Effective managing sustainable supply chain is on the most cutting-edge position of various organizations in the process of delivering products. There are many optimization technologies that have been applied in sustainable supply chain management, aiming to decrease the carbon emission level of production process. Among these methods, designing reliable renewable energy forecasting approaches is conductive to power management and optimization in sustainable and circular supply chain. However, irregular and non-stationary fluctuations of wind speed is a major obstacle to optimize the applications of renewable energy in sustainable supply chain management. There are various approaches for wind speed prediction, yet most of them ignore the significance of hyperparametric selection and combined forecasting strategy, resulting in unsatisfactory prediction results. To remedy the drawbacks, a combined prediction framework is proposed, which uses grid search to select suited hyperparameters of sub-models and employs an improved intelligent optimization algorithm (ranking-based adaptive cuckoo search algorithm) to calculate the optimal weighting coefficient of sub-models. Nine datasets are collected to validate the proposed model and 15 benchmark models. The simulations revel that the proposed model yields the satisfactory prediction level, which precedes comparative models based on statistical test results. Hence, it is a valuable tool for decision makers to provide key reference information in sustainable supply chain management and optimization.
Development and application of a hybrid forecasting framework based on improved extreme learning machine for enterprise financing risk
2023, Expert Systems with Applications
Citation Excerpt :
Therefore, the introduction of multi-source data into the enterprise financing risk forecasting model is an important research direction in the future, which is worth exploring in the future. Moreover, in the future, the proposed forecasting framework can be explored to other fields, such as ventilation diagnosis (Glowacz, 2021b), loan default forecasting (Zhang, Wang, & Liu, 2023), patients no-show prediction (Fan, Deng, Ye, & Wang, 2021) and fault diagnosis (Glowacz, 2019; Glowacz, 2021a; Glowacz et al., 2021). Zongguo Ma: Conceptualization, Methodology, Formal analysis, Supervision, Writing – original draft, Funding acquisition, Project administration.
A scientific framework that can effectively forecast enterprise financing risks can both promote enterprise management and reduce the cost of risk for financial institutions. This study constructs a novel hybrid forecasting framework for enterprise financing risk incorporating modules for data preprocessing, feature selection, forecasting, and evaluation. Specifically, the data preprocessing module mainly realizes the prescreen financing risk indicators and solves the forecasting challenge created by imbalanced data; The feature selection module based on binary grey wolf optimization is designed to intelligently identify optimal financing risk indicators; The forecasting module based on the improved extreme learning machine model established in this paper achieves higher forecasting accuracy; and the evaluation module provides reasonable and scientific evaluations of the proposed hybrid forecasting framework by using the data from small and medium-sized enterprises (SMEs) in China and all listed enterprises with Shanghai and Shenzhen A-shares. Using the SMEs dataset as an example, the Type-2 error value of the developed hybrid forecasting framework is 0.1765, which is 70.24% lower than the average result of the other models; the G-mean value of the framework is 0.8566, which is 40.56% higher than the average result of the other models. Based on the results, the proposed hybrid forecasting framework outperforms other comparative models and is a reliable tool for forecasting enterprise financing risk.
Accurate combination forecasting of wave energy based on multiobjective optimization and fuzzy information granulation
2023, Journal of Cleaner Production
Wave energy forecasting modeling is critical for promoting renewable energy storage technology as well as for energy sustainability and global carbon neutrality goals. However, due to the irregular volatility and complexity in wave energy data, all the effective information cannot be fully utilized by a traditional forecasting model; moreover, the point forecasting results cannot be used to effectively analyze the uncertainty of the time series. To overcome these shortcomings, a multistep point-interval combined significant wave height forecasting system based on the multiobjective grasshopper optimization algorithm and the fuzzy information granulation strategy is proposed to forecast the half-hour actual wave height at different buoy locations. Applying this system, Pareto optimal weights can be obtained to integrate the respective advantages of deep learning and neural network models in the combined forecasting module, achieve the best point and interval forecasting accuracy and accurately analyze the uncertainty of point forecasting results. Among the combined models, the proposed system has a more comprehensive and scientific prediction performance than other models (MAPE = 4.9866 for Site 1, MAPE = 4.9138 for Site 2, and MAPE = 3.9572 for Site 3). The forecasting outcomes indicate that the developed system significantly improves forecasting accuracy and stability, which provides reliable technical support for the sustainable development of wave power generation.
A novel combined model for probabilistic load forecasting based on deep learning and improved optimizer
2023, Energy
Citation Excerpt :
There are many hyperparameters of QRCNNbiLSTM that need to be determined, such as the number of hidden layers and hidden layer units. We introduce Bayesian theory [5] to realize hyperparameter optimization to reinforce our model, so that appropriate hyperparameter values can be captured to improve the performance of the model on the testing set. The Bayesian optimization (BO) [52] can be implemented as follows:
As the transitions of the power industry to decarburization and distributed energy systems, the future uncertainty information of electric load is becoming essential in power systems planning and operation. However, a great number of studies focus on point forecasting, which only provides the expected value at each time step and it cannot provide uncertainty information. This paper proposed a novel probabilistic load forecasting model by combining quantile regression (QR) with a hybrid model to improve smart grid reliability. In addition, to further improve accuracy and solve the problem that the optimal model is not unique, we propose a new combined probabilistic forecasting model (CPFM). The CPFM employs the traditional statistical models and QR-machine learning models as alternative models; several alternative models with the best performance are combined through the improved multi-objective optimizer to obtain the final forecasting results. The ISO New England data is modeled as a case study to verify the effectiveness of the proposed CPFM. The comparative study includes 13 models, and the results show that the proposed CPFM has better performance in reliability, resolution, and sharpness.

View all citing articles on Scopus

View full text

What should lenders be more concerned about? Developing a profit-driven loan default prediction model

Highlights

Abstract

Introduction

Section snippets

Literature review

Methodology

Experimental setup and result analysis

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Credit risk analysis using machine and deep learning models

Risks

Bayesian optimization algorithm based support vector regression analysis for estimation of shear capacity of FRP reinforced concrete members

Applied Soft Computing

Benchmarking state-of-the-art classification algorithms for credit scoring

Journal of the Operational Research Society

Random forests

Machine Learning

Classification and Regression Trees (Wadsworth Statistics/Probability)

Modeling default risk with support vector machines

Quantitative Finance

XGBoost: A scalable tree boosting system

Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques

On the optimality of the simple Bayesian classifier under zero-one loss

Machine Learning

Pitfalls in the application of discriminant analysis in business, finance, and economics

The Journal of Finance.

How can lenders prosper? Comparing machine learning approaches to identify profitable peer-to-peer loan investments

European Journal of Operational Research.

A multi-component hybrid system based on predictability recognition and modified multi-objective optimization for ultra-short-term onshore wind speed forecasting

Renewable Energy

A Robust profit measure for binary classification model evaluation

Expert Systems with Applications

Prediction of financial distress: An empirical study of listed Chinese companies using data mining

European Journal of Operational Research

Instance-based credit risk assessment for investment decisions in P2P lending

European Journal of Operational Research

Ensemble learning or deep learning? Application to default risk analysis

Journal of Risk and Financial Management

Credit scoring using the clustered support vector machine

Expert Systems with Applications

A novel hybrid ensemble model based on tree-based method and deep learning method for default prediction

Expert Systems with Applications

Accelerating multi-layer perceptron based short term demand forecasting using graphics processing units

Transmission and Distribution Conference and Exposition: Asia and Pacific, T and D Asia

Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions

Journal of Hydrology

Comparison of the CatBoost classifier with other machine learning methods

International Journal of Advanced Computer Science and Applications.

LightGBM: A highly efficient gradient boosting decision tree

Credit scoring and default risk prediction: A comparative study between discriminant analysis & logistic regression

International Journal of Economics and Finance

A multi-objective approach for profit-driven feature selection in credit scoring

Decision Support Systems

The network loan risk prediction model based on Convolutional neural network and Stacking fusion model

Applied Soft Computing

Forecasting peer-to-peer platform default rate with LSTM neural network

Electronic Commerce Research and Applications

Credit scoring based on tree-enhanced gradient boosting decision trees

Expert Systems with Applications

Profit-based credit scoring based on robust optimization and feature selection

Information Sciences

A newly combination model based on data denoising strategy and advanced optimization algorithm for short-term wind speed prediction

Journal of Ambient Intelligence and Humanized Computing

Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

Electronic Commerce Research and Applications

Integrated framework for profit-based feature selection and SVM classification in credit scoring

Decision Support Systems

Corporate default forecasting with machine learning

Expert Systems with Applications