Approaches for credit scorecard calibration: An empirical analysis
Introduction
Credit scoring helps to improve the efficiency of loan officers, reduce human bias in lending decisions, quantify expected losses, and, more generally, manage financial risks effectively and responsibly [13], [21]. Today, almost all lenders rely upon scoring systems to assess financial risks [44]. In retail lending, for example, credit scoring is widely used to decide on applications for personal credit cards, consumer loans, and mortgages [26]. A lender employs data from past transactions to predict the chance of an applicant to default. To decide on the application, the lender then compares the predicted probability to default (PD) to a cut-off value; granting credit if the prediction is below the cut-off, and rejecting it otherwise [35].
Many techniques for scorecard development have been proposed and studied. Examples include artificial neural networks [2], [51], [55], support vector machines [16], [36], multiple classifier systems [27], hybrid models [4], [18], or genetic programming [1]. In general, any classification algorithm facilitates the construction of a scorecard and PD modeling in particular [15]. Logistic regression is the most widely used approach in industry [44], although other, more sophisticated classification algorithms have been shown to predict credit risks more accurately [6], [45]. A fully-comprehensive review of 214 articles/books/theses on application credit scoring further supports [3] the view that more advanced techniques (e.g., genetic algorithms) outperform conventional models (e.g., logistic regression). However, the authors also report on studies that find similar performance in terms of predictive accuracy [3].
In addition to predictive accuracy, the suitability of a scorecard also depends on other dimensions such as comprehensibility and compliance [34] or defining key variables for classifiers by mitigating noise data and redundant attributes [69]. This paper, however, concentrates on one specific dimension of scorecard performance: calibration.
A well-calibrated scorecard is one which produces probabilistic forecasts that correspond with observed probabilities [20]. For example, consider one hundred loans in a band of predicted PD estimated to be ten percent by some scorecard. If the scorecard is well-calibrated, the actual number of eventually defaulting loans in this band should be close to ten.
Scorecard calibration is important for many reasons. Regulatory frameworks such as the Basel Accord require financial institutions to verify that their internal rating systems produce calibrated risk predictions. Poor calibration, therefore, is penalized with higher regulatory capital requirements [22]. Calibration is also relevant from a lending decision making point of view [19]. At a micro-level, well-calibrated risk predictions are essential to evaluate credit applications in economic terms (e.g., through calculating expected gains/losses), which is more relevant to the business than an evaluation in terms of statistical accuracy measures only [12], [33]. At a macro-level, calibration is important for portfolio risk management and default rate estimation [64]. In particular, to forecast the default rate of a credit portfolio, one may adopt a classify-and-count strategy [10]. This approach derives the portfolio default rate forecast from individual level (single loan) risk predictions and thus benefits from calibration [65]. Furthermore, approaches to support managerial decisions have to account for the cognitive abilities and limitations of decision makers [50]. Although far from perfect, probabilities (rather than, say, log-odds) are a format to represent information that decision makers understand and process relatively well [43]. Thus, a credit analyst is likely to distil more information from a well-calibrated PD estimate. Last, scorecards are developed from loans granted in the past and used to forecast the risk of lending to novel applicants [37]. Due to changes in customer behavior, economic conditions, etc. default rates may differ across the corresponding distributions. Calibration is a way to account for the differences in prior probabilities [20].
The Institute of International Finance, Inc. and the International Swaps and Derivatives Association have called for a higher recognition of calibration when choosing among scorecards [40]. However, we find multiple studies that concentrate on e.g., balancing between accuracy and complexity [76], improving existing classifiers [69], offering new multiple classifier systems [6], but rarely studies devoted to calibration. In [3] the authors conclude that the receiver operating characteristic curve and the Gini coefficient are the most popular performance evaluation criteria in credit scoring. That is why we argue that the relevance of calibration is still not sufficiently reflected in the credit scoring literature. To further support this point, we consider a recent review of more than forty empirical credit scoring studies published between 2003 and 2014 [49]. Among the articles reviewed in [49], we find only one study [46] that explicitly raises the issue of calibration and uses suitable evaluation metrics such as Brier Score. More recent literature published after 2014 shows the same pattern. We find only two studies that use Brier Score to measure classifier performance [5], [6]. However, both studies concentrate on developing novel classification systems, which are assessed in terms of the Brier Score, amongst others. Neither [5] nor [6] consider techniques to improve calibration, which supports the view that calibration methods have not been examined sufficiently in credit scoring; or, in the words of Van Hoorde et al. [68]: calibration is often overlooked in risk modeling.
There is ample evidence that especially advanced learning algorithms such as random forest, which enjoy much popularity in credit scoring, produce predictions that are poorly calibrated [47], [54], [60], [74]. This suggests a trade-off between predictive accuracy and calibration. Calibration assumes that the relationship between the raw score, which a classification model produces, and the true PD is monotonic. Therefore, calibration consists of estimating a monotonic function to map raw scores to (calibrated) PDs. Given that the calibration function is monotonic, it maintains the ordering of the cases by raw score and consequently has no effect on the discriminative power of classifiers [71]. Examples of calibration techniques include isotonic regression or Platt scaling [54]. They promise to overcome the accuracy-calibration-trade-off and seem to have potential for credit scoring. To the best of our knowledge, no attempt has been made to systematically explore this potential in prior work in credit scoring.
The goal of this paper is to close this research gap. More specifically, we aim at examining the degree to which alternative algorithms for scorecard development suffer from poor calibration, evaluating techniques for improving calibration, and, thereby, contributing towards increasing the fit of advanced classifiers for real-world banking requirements. In pursuing these objectives, we make the following contributions. First, we establish the difference between accuracy and calibration measures. This helps to understand the conceptual differences between the two and to emphasize the need to address calibration in scorecard development. Second, we introduce several methods to improve calibration, subsequently called calibrators, to the credit scoring community and systematically assess their performance through empirical experimentation. Third, we examine the interaction between classifiers and calibrators. This allows us to identify synergies between the modeling approaches and to provide specific recommendations which techniques work well together. Last, relying upon reliability analysis and a decomposition of the Brier Score, we shed light on the determinants of calibrator effectiveness and provide insight into why and when calibrators work well.
The remainder of the paper is organized as follows: Section 2 introduces relevant methodology and the calibrators in particular. Section 3 describes the experimental design before empirical results are presented in Section 4. Section 5 concludes the paper.
Section snippets
Calibration methods
A classifier or a scorecard estimates a functional relationship between the probability distribution of a binary class label - good or bad risk - and a set of explanatory variables, which profile the applicant's characteristics and behavior. For example, bad risks are commonly defined as customers who miss three consecutive payments [66]. Calibration serves two purposes. First, some classification algorithms are unable to produce probabilistic predictions. For instance, support vector machines
Experimental setup
We examine the relative effectiveness of the calibrators for credit scoring through empirical experimentation. Our experimental design draws inspiration from Lessmann et al. [49]. Their study compares 41 classification algorithms across eight retail credit scoring data sets using different indicators of classification performance. The data sets are well known in credit scoring and have – at least partially – been used in several prior studies, e.g., [9], [11], [15], [32], [47], [67], [72]. We
Empirical results
The experimental results consist of the performance estimates for every combination of the factors: classifier (5 levels), calibrator (7 levels), credit scoring data set (8 levels), and performance measure (3 levels). The performance measure capture the degree to which classifiers discriminate good and bad credit risk with high accuracy and are well calibrated, respectively.
Conclusion
We set out to examine how different calibration methods contribute to the improvement of the probability forecast quality of classification algorithms in credit scoring. Calibration can be seen as one dimension of scorecard quality and is a regulatory requirement. Given that only a few credit scoring studies raise the issue of calibration and that the assessment of scorecards in this literature is predominantly based on indicators that do not capture calibration, our study aimed at filling this
References (76)
Genetic programming for credit scoring: the case of Egyptian public sector banks
Expert Syst. Appl.
(2009)- et al.
Neural nets versus conventional techniques in credit scoring in Egyptian banking
Expert Syst. Appl.
(2008) - et al.
Predicting creditworthiness in retail banking with limited scoring data
Knowl.-Based Syst.
(2016) - et al.
Classifiers consensus system approach for credit scoring
Knowl.-Based Syst.
(2016) - et al.
A new hybrid ensemble credit scoring model based on classifiers consensus system approach
Expert Syst. Appl.
(2016) - et al.
Corporate distress diagnosis: comparisons using linear discriminant analysis and neural networks (the italian experience)
J. Bank. Finance
(1994) - et al.
Quantification-oriented learning based on reliable classifiers
Pattern Recognit.
(2015) - et al.
Accuracy of machine learning models versus “hand crafted” expert systems - a credit scoring case study
Expert Syst. Appl.
(2009) - et al.
Economic benefit of powerful credit scoring
J. Bank. Finance
(2006) - et al.
An experimental comparison of classification algorithms for imbalanced credit scoring data sets
Expert Syst. Appl.
(2012)
Clustering categories in support vector machines
Omega
Hybrid models based on rough set classifiers for setting credit rating decision rules in the global banking industry
Knowl.-Based Syst.
A probability-mapping algorithm for calibrating the posterior probabilities: A direct marketing application
Eur. J. Oper. Res.
Recent developments in consumer credit risk assessment
Eur. J. Oper. Res.
A comparative analysis of current credit risk models
J. Bank. Finance
Credit scorecard based on logistic regression with random coefficients
Procedia Comput. Sci.
Multiple classifier architectures and their application to credit risk assessment
Eur. J. Oper. Res.
An empirical comparison of classification algorithms for mortgage default prediction: evidence from a distressed mortgage market
Eur. J. Oper. Res.
Enhancing accuracy and interpretability of ensemble strategies in credit risk assessment. A correlated-adjusted decision forest proposal
Expert Syst. Appl.
Calibration of subjective probability judgments in a naturalistic setting
J. Organ. Behav. Human Decis. Process.
Consumer credit risk: individual probability estimates using machine learning
Expert Syst. Appl.
Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research
Eur. J. Oper. Res.
Evaluating consumer loans using neural networks
Omega
Subagging for credit scoring models
Eur. J. Oper. Res.
A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers
Int. J. Forecasting
A spline-based tool to assess and visualize the calibration of multiclass risk predictions
J. Biomed. Inform.
Two credit scoring models based on dual strategy ensemble trees
Knowl.-Based Syst.
Using data mining to improve assessment of credit worthiness via credit scoring models
Expert Syst. Appl.
Balancing accuracy, complexity and interpretability in consumer credit decision making: a C-TOPSIS classification approach
Knowl.-Based Syst.
Credit scoring, statistical techniques and evaluation criteria: a review of the literature
Intell. Syst. Account. Finance Manag.
An empirical distribution function for sampling with incomplete information
Ann. Math. Stat.
Benchmarking state-of-the-art classification algorithms for credit scoring
J. Oper. Res. Soc.
Example-dependent cost-sensitive logistic regression for credit scoring
A new goodness-of-fit test for event forecasting and its application to credit defaults
Manag. Sci.
Getting the most out of ensemble selection
Incentivizing calculated risk‐taking: Evidence from an experiment with commercial bank loan officers
J. Finance
Approximate statistical tests for comparing supervised classification learning
Neural Comput.
Credit scoring model based on back propagation neural network using various activation and error function
Int. J. Comput. Sci. Netw. Secur.
Cited by (35)
Flexible loss functions for binary classification in gradient-boosted decision trees: An application to credit scoring
2024, Expert Systems with ApplicationsAn ensemble credit scoring model based on logistic regression with heterogeneous balancing and weighting effects
2023, Expert Systems with ApplicationsCitation Excerpt :Since then, more researchers have explored using multiple basic models to develop ensemble models to further enhance the evaluation effect. The scorecard model is the oldest single structure model and has always been one of the leading technologies in the fields of credit risk analysis (Bequé et al., 2017). It combines the experience of credit experts, gives a particular score to each risk index, and gets the final score by summing up all sub-scores.
Boosting credit risk models
2023, British Accounting ReviewA novel XGBoost extension for credit scoring class-imbalanced data combining a generalized extreme value link and a modified focal loss function
2022, Expert Systems with ApplicationsCitation Excerpt :Table 4 summarizes the critical details of the primary datasets for the experiments. Several credit studies have utilized the Taiwan, GMC and Fraud datasets, including Bequé, Coussement, Gayler, and Lessmann (2017), Junior et al. (2020), Lessmann et al. (2015), Moula, Guotai, and Abedin (2017), Wang, Xu, and Zhou (2015), and Zhang et al. (2020). However, the results are not directly comparable due to different experimental designs.
A structural hidden Markov model for forecasting scenario probabilities for portfolio loan loss provisions
2022, Knowledge-Based SystemsCitation Excerpt :When, instead of few, the context of the problem allows to work with many states, i.e. with a quasi-continuous state space, it becomes no longer necessary to estimate the cutoffs and Tauchen’s original approach can be used. The recent focus of the credit risk literature concerned mostly the default and loss-given-default parameter (e.g. [11–24]), but a few papers started to address the revised accounting regimes: novel aspects of parameter estimation were addressed by Miu and Ozdemir [25] and Krüger et al. [26]. Investigations about the cyclicality of loan loss provisions were conducted by Scheule et al. [27] and Bhat et al. [28].
Two-stage rule extraction method based on tree ensemble model for interpretable loan evaluation
2021, Information Sciences