What should lenders be more concerned about? Developing a profit-driven loan default prediction model
Introduction
Online lending is an innovative form of credit that eliminates the need for financial intermediaries, instead directly releasing the loans through the online lending platform (Ouyang, Zhi, & Wu, 2021). Online lending has seen explosive growth in recent years because of its flexibility, accessibility, and characteristic of decreasing financing costs (Guo, Zhou, Luo, Liu, & Xiong, 2016). The main characteristic of the online lending market is that smaller loan orders are requested relative to traditional loan orders, but with larger numbers of customers. When several borrowers default on their loans, online lending platforms may face cash flow troubles, which become a major threat to their management and operation (Zhang, Wang, Zhang, & Wang, 2020). Unfortunately, due to imperfect information, most online lending platforms perform poorly in credit quality evaluation and loan default risk management. To better manage loan default risk and increase stakeholders’ profits, it is important to assess the default risk of a loan application when platforms provide loans for borrowers.
Generally, loan default risk is defined as the risk of loss due to the borrower's failure to fulfill the loan contract on time (Yuan, Chi, Zhou, & Yin, 2022). Loan default is detrimental to the profits of lenders and the development of the lending platform or financial intermediary. To this end, default risk prediction is widely developed to explore the relationship between the attributes of historical data and the potential default status to effectively identify loan defaulters and provide valuable references for financial institutions, platforms, and lenders (Geng, Bose, & Chen, 2015). However, in the process of default prediction, once a defaulter is wrongly taken as a non-defaulter, it will incur huge economic losses; while a non-defaulter is classified as a defaulter, it will disrupt the pool of high-quality customers (Papouskova & Hajek, 2019). Thus, it is crucial to establish a reliable default prediction model to accurately identify default customers and maximize lenders’ profit.
Specifically, evaluating whether a borrower will default is essentially a binary classification problem. In the past few decades, numerous forecasting technologies appliable to solving binary classification problems have been proposed for loan default forecasting, such as statistical models and machine learning models. Commonly-used statistical models in loan default prediction include logistic regression (LR) (Wiginton, 1980) and discriminant analysis (Khemais, Nesrine, & Mohamed, 2016). Statistical models are popular prediction devices because they can provide accurate and easily interpreted prediction results by virtue of simple functional forms. However, the prediction performance of these models is susceptible to the interference of changes in the economic environment and credit market, such as the increase in default rates (Moscatelli, Parlapiano, Narizzano, & Viggiano, 2020). Aside from this, statistical models show poor performance in measuring the nonlinear relationship among economic, financial, and credit variables. A growing body of literature indicates that the complex nonlinear interactions between forecasting models and output can be successfully modeled by machine learning models. Machine learning methods forgo evaluating the importance of influence factors for default risk while presenting accurate out-of-sample prediction results (Baesens et al., 2003). Support vector machine (SVM) (Lv, Wang, Niu, & Lu, 2022), random forest (RF), artificial neural networks (ANNs) (Gao et al., 2022, Wang et al., 2022), and gradient boosting decision trees (GBDTs) are available machine learning methods, among which GBDTs have drawn great attention in the default prediction domain. This is because the high predictive performance of most machine learning methods comes at the expense of interpretability and intelligibility, while GBDTs can fulfill the objective of high transparency and intelligibility as well as provide satisfactory prediction performance. Light gradient boosting machine (LGBM), extreme gradient boosting (XGBT), and categorical boosting (CBT) are several frequently used GBDTs in loan default prediction. Compared to LGBM and XGBT, CBT involves two significant improvements, namely a powerful scheme for categorical features and unbiased estimation of the gradient step, making it outperform both LGBM and XGBT (Prokhorenkova, Gusev, Vorobev, Dorogush, & Gulin, 2018). Moreover, these two improvements enhance the performance of CBT in handing categorical features, shorten the layout time, and avoid overfitting problems. Thus, CBT is a suitable alternative for predicting default risk. However, the prediction performance of CBT is significantly affected by the setting of hyperparameters. Although CBT was used to predict loan default in some existing literature, the importance of parameter optimization is often ignored or the parameters are mostly determined by grid search or experience (Qi et al., 2021, Tounsi et al., 2020), which may affect the prediction performance. The emergence of optimization algorithms overcomes this drawback (Niu et al., 2022, Wang et al., 2021). Some studies have used Bayesian optimization (BO) to estimate the hyperparameter of CBT (Xia et al., 2020, Zhou et al., 2021, Xia et al., 2020). However, the setting of the objective function of the BO algorithm is based on an accuracy indicator, ignoring that the core goal of loan default prediction is to reduce financial loss and maximize the lender's profit.
According to the above analysis, some challenges for current loan default prediction technologies exist. First, stakeholders prefer making more profits by differentiating good borrowers from bad ones rather than identifying defaulters more accurately (Ye, Dong, & Ma, 2018), which means that more profit-oriented rather than accuracy-oriented loan default prediction strategies should be explored. Second, most previous researchers have paid more attention to profitability evaluation (Garrido et al., 2018, Verbraken et al., 2014), with few studies incorporating profit maximization into the process of constructing prediction models. Further, no study integrates a profit-based measure and optimization algorithm into the process of the training of CBT. Inspired by these challenges, this study proposes a profit-driven loan default prediction system consisting of a CBT and BO algorithm (Profit-BOCBT), aiming to provide a new perspective for effective loan default prediction, satisfy the core goal of profit maximum for lenders, help decision makers realize better default risk management, and improve credit market efficiency.
The primary innovation of this study lies in that we propose a new profit-driven loan default prediction model, which innovatively incorporates profit-based metric into the guidance for the training process of the CBT. Specifically, a profit-based measure, namely the average extra profit rate (APR), is set as the learning objective of the BO algorithm to optimize the core hyperparameters of the CBT in the training process. The prediction values are then associated with the sample feature values based on the Shapley additive explanations (SHAP) value to provide interpretable prediction results. The experimental results demonstrate that the proposed model achieves high-quality prediction performance with the mean APR values of 3.0872 % and 2.1858 %, respectively, and the mean Profit values of 5168.8762 and 352.9787 in datasets 1 and 2, respectively. Thus, the proposed Profit-BOCBT can earn more profits for stakeholders and provide valuable references for lenders and lending institutions.
The remainder of this paper is organized as follows. Section 2 presents the literature review on default prediction methods based on data-driven methods and profit measure. Section 3 introduces the methodologies used in the proposed Profit-BOCBT system. Section 4 shows the experimental setup and corresponding results analysis, and Section 5 concludes the paper.
Section snippets
Literature review
Two themes in the literature, including data-driven and profit-related approaches in loan default prediction, are reviewed in this section.
Methodology
In this section, we introduce the methodologies used in the proposed loan default prediction system, including CBT, BO algorithm, and prediction performance measures. The detailed flow of the proposed Profit-BOCBT is given in Fig. 1.
Experimental setup and result analysis
In this section, the data description, experimental setup, simulation results, statistical test, interpretability analysis, computational efficiency, and practical suggestions are explained in detail.
Conclusion
This study proposed a novel perspective for profit-driven loan default prediction, that is, using BO to optimize the hyperparameters of CBT, and the optimization objective is innovatively set as a profit indicator (i.e., APR). The prediction performance of the proposed model is compared with ten frequently used prediction models using two datasets from Renrendai and Lending Club in two aspects: accuracy and profit. Moreover, SHAP values of input variables for the proposed model in Dataset 1 are
CRediT authorship contribution statement
Lifang Zhang: Writing – original draft. Jianzhou Wang: Writing – review & editing. Zhenkun Liu: Software, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported by Major Program of National Social Science Foundation of China (Grant No. 17ZDA093).
References (68)
- et al.
Credit risk analysis using machine and deep learning models
Risks
(2018) - et al.
Bayesian optimization algorithm based support vector regression analysis for estimation of shear capacity of FRP reinforced concrete members
Applied Soft Computing
(2021) - et al.
Benchmarking state-of-the-art classification algorithms for credit scoring
Journal of the Operational Research Society
(2003) - Barua, S., Gavandi, D., Sangle, P., Shinde, L., & Ramteke, J. (2021). Swindle: Predicting the Probability of Loan...
Random forests
Machine Learning
(2001)- et al.
Classification and Regression Trees (Wadsworth Statistics/Probability)
(1984) - Byanjankar, A., Heikkila, M., & Mezei, J. (2015). Predicting credit risk in peer-to-peer lending: A neural network...
- et al.
Modeling default risk with support vector machines
Quantitative Finance
(2011) - et al.
XGBoost: A scalable tree boosting system
Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques
(1991)
On the optimality of the simple Bayesian classifier under zero-one loss
Machine Learning
Pitfalls in the application of discriminant analysis in business, finance, and economics
The Journal of Finance.
How can lenders prosper? Comparing machine learning approaches to identify profitable peer-to-peer loan investments
European Journal of Operational Research.
A multi-component hybrid system based on predictability recognition and modified multi-objective optimization for ultra-short-term onshore wind speed forecasting
Renewable Energy
A Robust profit measure for binary classification model evaluation
Expert Systems with Applications
Prediction of financial distress: An empirical study of listed Chinese companies using data mining
European Journal of Operational Research
Instance-based credit risk assessment for investment decisions in P2P lending
European Journal of Operational Research
Ensemble learning or deep learning? Application to default risk analysis
Journal of Risk and Financial Management
Credit scoring using the clustered support vector machine
Expert Systems with Applications
A novel hybrid ensemble model based on tree-based method and deep learning method for default prediction
Expert Systems with Applications
Accelerating multi-layer perceptron based short term demand forecasting using graphics processing units
Transmission and Distribution Conference and Exposition: Asia and Pacific, T and D Asia
Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions
Journal of Hydrology
Comparison of the CatBoost classifier with other machine learning methods
International Journal of Advanced Computer Science and Applications.
LightGBM: A highly efficient gradient boosting decision tree
Credit scoring and default risk prediction: A comparative study between discriminant analysis & logistic regression
International Journal of Economics and Finance
A multi-objective approach for profit-driven feature selection in credit scoring
Decision Support Systems
The network loan risk prediction model based on Convolutional neural network and Stacking fusion model
Applied Soft Computing
Forecasting peer-to-peer platform default rate with LSTM neural network
Electronic Commerce Research and Applications
Credit scoring based on tree-enhanced gradient boosting decision trees
Expert Systems with Applications
Profit-based credit scoring based on robust optimization and feature selection
Information Sciences
A newly combination model based on data denoising strategy and advanced optimization algorithm for short-term wind speed prediction
Journal of Ambient Intelligence and Humanized Computing
Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning
Electronic Commerce Research and Applications
Integrated framework for profit-based feature selection and SVM classification in credit scoring
Decision Support Systems
Corporate default forecasting with machine learning
Expert Systems with Applications
Cited by (15)
Profit-driven weighted classifier with interpretable ability for customer churn prediction
2024, Omega (United Kingdom)A combined system based on data preprocessing and optimization algorithm for electricity load forecasting
2024, Computers and Industrial EngineeringCombined forecasting tool for renewable energy management in sustainable supply chains
2023, Computers and Industrial EngineeringDevelopment and application of a hybrid forecasting framework based on improved extreme learning machine for enterprise financing risk
2023, Expert Systems with ApplicationsCitation Excerpt :Therefore, the introduction of multi-source data into the enterprise financing risk forecasting model is an important research direction in the future, which is worth exploring in the future. Moreover, in the future, the proposed forecasting framework can be explored to other fields, such as ventilation diagnosis (Glowacz, 2021b), loan default forecasting (Zhang, Wang, & Liu, 2023), patients no-show prediction (Fan, Deng, Ye, & Wang, 2021) and fault diagnosis (Glowacz, 2019; Glowacz, 2021a; Glowacz et al., 2021). Zongguo Ma: Conceptualization, Methodology, Formal analysis, Supervision, Writing – original draft, Funding acquisition, Project administration.
Accurate combination forecasting of wave energy based on multiobjective optimization and fuzzy information granulation
2023, Journal of Cleaner ProductionA novel combined model for probabilistic load forecasting based on deep learning and improved optimizer
2023, EnergyCitation Excerpt :There are many hyperparameters of QRCNNbiLSTM that need to be determined, such as the number of hidden layers and hidden layer units. We introduce Bayesian theory [5] to realize hyperparameter optimization to reinforce our model, so that appropriate hyperparameter values can be captured to improve the performance of the model on the testing set. The Bayesian optimization (BO) [52] can be implemented as follows: