Elsevier

Knowledge-Based Systems

Volume 134, 15 October 2017, Pages 213-227
Knowledge-Based Systems

Approaches for credit scorecard calibration: An empirical analysis

https://doi.org/10.1016/j.knosys.2017.07.034Get rights and content

Abstract

Financial institutions use credit scorecards for risk management. A scorecard is a data-driven model for predicting default probabilities. Scorecard assessment concentrates on how well a scorecard discriminates good and bad risk. Whether predicted and observed default probabilities agree (i.e., calibration) is an equally important yet often overlooked dimension of scorecard performance. Surprisingly, no attempt has been made to systematically explore different calibration methods and their implications in credit scoring. The goal of the paper is to integrate previous work on probability calibration, to re-introduce available calibration techniques to the credit scoring community, and to empirically examine the extent to which they improve scorecards. More specifically, using real-world credit scoring data, we first develop scorecards using different classifiers, next apply calibration methods to the classifier predictions, and then measure the degree to which they improve calibration. To evaluate performance, we measure the accuracy of predictions in terms of the Brier Score before and after calibration, and employ repeated measures analysis of variance to test for significant differences between group means. Furthermore, we check calibration using reliability plots and decompose the Brier Score to clarify the origin of performance differences across calibrators. The observed results suggest that post-processing scorecard predictions using a calibrator is beneficial. Calibrators improve scorecard calibration while the discriminatory ability remains unaffected. Generalized additive models are particularly suitable for calibrating classifier predictions.

Introduction

Credit scoring helps to improve the efficiency of loan officers, reduce human bias in lending decisions, quantify expected losses, and, more generally, manage financial risks effectively and responsibly [13], [21]. Today, almost all lenders rely upon scoring systems to assess financial risks [44]. In retail lending, for example, credit scoring is widely used to decide on applications for personal credit cards, consumer loans, and mortgages [26]. A lender employs data from past transactions to predict the chance of an applicant to default. To decide on the application, the lender then compares the predicted probability to default (PD) to a cut-off value; granting credit if the prediction is below the cut-off, and rejecting it otherwise [35].

Many techniques for scorecard development have been proposed and studied. Examples include artificial neural networks [2], [51], [55], support vector machines [16], [36], multiple classifier systems [27], hybrid models [4], [18], or genetic programming [1]. In general, any classification algorithm facilitates the construction of a scorecard and PD modeling in particular [15]. Logistic regression is the most widely used approach in industry [44], although other, more sophisticated classification algorithms have been shown to predict credit risks more accurately [6], [45]. A fully-comprehensive review of 214 articles/books/theses on application credit scoring further supports [3] the view that more advanced techniques (e.g., genetic algorithms) outperform conventional models (e.g., logistic regression). However, the authors also report on studies that find similar performance in terms of predictive accuracy [3].

In addition to predictive accuracy, the suitability of a scorecard also depends on other dimensions such as comprehensibility and compliance [34] or defining key variables for classifiers by mitigating noise data and redundant attributes [69]. This paper, however, concentrates on one specific dimension of scorecard performance: calibration.

A well-calibrated scorecard is one which produces probabilistic forecasts that correspond with observed probabilities [20]. For example, consider one hundred loans in a band of predicted PD estimated to be ten percent by some scorecard. If the scorecard is well-calibrated, the actual number of eventually defaulting loans in this band should be close to ten.

Scorecard calibration is important for many reasons. Regulatory frameworks such as the Basel Accord require financial institutions to verify that their internal rating systems produce calibrated risk predictions. Poor calibration, therefore, is penalized with higher regulatory capital requirements [22]. Calibration is also relevant from a lending decision making point of view [19]. At a micro-level, well-calibrated risk predictions are essential to evaluate credit applications in economic terms (e.g., through calculating expected gains/losses), which is more relevant to the business than an evaluation in terms of statistical accuracy measures only [12], [33]. At a macro-level, calibration is important for portfolio risk management and default rate estimation [64]. In particular, to forecast the default rate of a credit portfolio, one may adopt a classify-and-count strategy [10]. This approach derives the portfolio default rate forecast from individual level (single loan) risk predictions and thus benefits from calibration [65]. Furthermore, approaches to support managerial decisions have to account for the cognitive abilities and limitations of decision makers [50]. Although far from perfect, probabilities (rather than, say, log-odds) are a format to represent information that decision makers understand and process relatively well [43]. Thus, a credit analyst is likely to distil more information from a well-calibrated PD estimate. Last, scorecards are developed from loans granted in the past and used to forecast the risk of lending to novel applicants [37]. Due to changes in customer behavior, economic conditions, etc. default rates may differ across the corresponding distributions. Calibration is a way to account for the differences in prior probabilities [20].

The Institute of International Finance, Inc. and the International Swaps and Derivatives Association have called for a higher recognition of calibration when choosing among scorecards [40]. However, we find multiple studies that concentrate on e.g., balancing between accuracy and complexity [76], improving existing classifiers [69], offering new multiple classifier systems [6], but rarely studies devoted to calibration. In [3] the authors conclude that the receiver operating characteristic curve and the Gini coefficient are the most popular performance evaluation criteria in credit scoring. That is why we argue that the relevance of calibration is still not sufficiently reflected in the credit scoring literature. To further support this point, we consider a recent review of more than forty empirical credit scoring studies published between 2003 and 2014 [49]. Among the articles reviewed in [49], we find only one study [46] that explicitly raises the issue of calibration and uses suitable evaluation metrics such as Brier Score. More recent literature published after 2014 shows the same pattern. We find only two studies that use Brier Score to measure classifier performance [5], [6]. However, both studies concentrate on developing novel classification systems, which are assessed in terms of the Brier Score, amongst others. Neither [5] nor [6] consider techniques to improve calibration, which supports the view that calibration methods have not been examined sufficiently in credit scoring; or, in the words of Van Hoorde et al. [68]: calibration is often overlooked in risk modeling.

There is ample evidence that especially advanced learning algorithms such as random forest, which enjoy much popularity in credit scoring, produce predictions that are poorly calibrated [47], [54], [60], [74]. This suggests a trade-off between predictive accuracy and calibration. Calibration assumes that the relationship between the raw score, which a classification model produces, and the true PD is monotonic. Therefore, calibration consists of estimating a monotonic function to map raw scores to (calibrated) PDs. Given that the calibration function is monotonic, it maintains the ordering of the cases by raw score and consequently has no effect on the discriminative power of classifiers [71]. Examples of calibration techniques include isotonic regression or Platt scaling [54]. They promise to overcome the accuracy-calibration-trade-off and seem to have potential for credit scoring. To the best of our knowledge, no attempt has been made to systematically explore this potential in prior work in credit scoring.

The goal of this paper is to close this research gap. More specifically, we aim at examining the degree to which alternative algorithms for scorecard development suffer from poor calibration, evaluating techniques for improving calibration, and, thereby, contributing towards increasing the fit of advanced classifiers for real-world banking requirements. In pursuing these objectives, we make the following contributions. First, we establish the difference between accuracy and calibration measures. This helps to understand the conceptual differences between the two and to emphasize the need to address calibration in scorecard development. Second, we introduce several methods to improve calibration, subsequently called calibrators, to the credit scoring community and systematically assess their performance through empirical experimentation. Third, we examine the interaction between classifiers and calibrators. This allows us to identify synergies between the modeling approaches and to provide specific recommendations which techniques work well together. Last, relying upon reliability analysis and a decomposition of the Brier Score, we shed light on the determinants of calibrator effectiveness and provide insight into why and when calibrators work well.

The remainder of the paper is organized as follows: Section 2 introduces relevant methodology and the calibrators in particular. Section 3 describes the experimental design before empirical results are presented in Section 4. Section 5 concludes the paper.

Section snippets

Calibration methods

A classifier or a scorecard estimates a functional relationship between the probability distribution of a binary class label - good or bad risk - and a set of explanatory variables, which profile the applicant's characteristics and behavior. For example, bad risks are commonly defined as customers who miss three consecutive payments [66]. Calibration serves two purposes. First, some classification algorithms are unable to produce probabilistic predictions. For instance, support vector machines

Experimental setup

We examine the relative effectiveness of the calibrators for credit scoring through empirical experimentation. Our experimental design draws inspiration from Lessmann et al. [49]. Their study compares 41 classification algorithms across eight retail credit scoring data sets using different indicators of classification performance. The data sets are well known in credit scoring and have – at least partially – been used in several prior studies, e.g., [9], [11], [15], [32], [47], [67], [72]. We

Empirical results

The experimental results consist of the performance estimates for every combination of the factors: classifier (5 levels), calibrator (7 levels), credit scoring data set (8 levels), and performance measure (3 levels). The performance measure capture the degree to which classifiers discriminate good and bad credit risk with high accuracy and are well calibrated, respectively.

Conclusion

We set out to examine how different calibration methods contribute to the improvement of the probability forecast quality of classification algorithms in credit scoring. Calibration can be seen as one dimension of scorecard quality and is a regulatory requirement. Given that only a few credit scoring studies raise the issue of calibration and that the assessment of scorecards in this literature is predominantly based on indicators that do not capture calibration, our study aimed at filling this

References (76)

  • E. Carrizosa et al.

    Clustering categories in support vector machines

    Omega

    (2017)
  • Y.-S. Chen et al.

    Hybrid models based on rough set classifiers for setting credit rating decision rules in the global banking industry

    Knowl.-Based Syst.

    (2013)
  • K. Coussement et al.

    A probability-mapping algorithm for calibrating the posterior probabilities: A direct marketing application

    Eur. J. Oper. Res.

    (2011)
  • J.N. Crook et al.

    Recent developments in consumer credit risk assessment

    Eur. J. Oper. Res.

    (2007)
  • M. Crouhy et al.

    A comparative analysis of current credit risk models

    J. Bank. Finance

    (2000)
  • G. Dong et al.

    Credit scorecard based on logistic regression with random coefficients

    Procedia Comput. Sci.

    (2010)
  • S. Finlay

    Multiple classifier architectures and their application to credit risk assessment

    Eur. J. Oper. Res.

    (2011)
  • T. Fitzpatrick et al.

    An empirical comparison of classification algorithms for mortgage default prediction: evidence from a distressed mortgage market

    Eur. J. Oper. Res.

    (2016)
  • R. Florez-Lopez et al.

    Enhancing accuracy and interpretability of ensemble strategies in credit risk assessment. A correlated-adjusted decision forest proposal

    Expert Syst. Appl.

    (2015)
  • J.E.V. Johnson et al.

    Calibration of subjective probability judgments in a naturalistic setting

    J. Organ. Behav. Human Decis. Process.

    (2001)
  • J. Kruppa et al.

    Consumer credit risk: individual probability estimates using machine learning

    Expert Syst. Appl.

    (2013)
  • S. Lessmann et al.

    Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research

    Eur. J. Oper. Res.

    (2015)
  • R. Malhotra et al.

    Evaluating consumer loans using neural networks

    Omega

    (2003)
  • G. Paleologo et al.

    Subagging for credit scoring models

    Eur. J. Oper. Res.

    (2010)
  • L.C. Thomas

    A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers

    Int. J. Forecasting

    (2000)
  • K. Van Hoorde et al.

    A spline-based tool to assess and visualize the calibration of multiclass risk predictions

    J. Biomed. Inform.

    (2015)
  • G. Wang et al.

    Two credit scoring models based on dual strategy ensemble trees

    Knowl.-Based Syst.

    (2012)
  • B.W. Yap et al.

    Using data mining to improve assessment of credit worthiness via credit scoring models

    Expert Syst. Appl.

    (2011)
  • X. Zhu et al.

    Balancing accuracy, complexity and interpretability in consumer credit decision making: a C-TOPSIS classification approach

    Knowl.-Based Syst.

    (2013)
  • H.A. Abdou et al.

    Credit scoring, statistical techniques and evaluation criteria: a review of the literature

    Intell. Syst. Account. Finance Manag.

    (2011)
  • M. Ayer et al.

    An empirical distribution function for sampling with incomplete information

    Ann. Math. Stat.

    (1955)
  • B. Baesens et al.

    Benchmarking state-of-the-art classification algorithms for credit scoring

    J. Oper. Res. Soc.

    (2003)
  • A.C. Bahnsen et al.

    Example-dependent cost-sensitive logistic regression for credit scoring

  • A. Blöchlinger et al.

    A new goodness-of-fit test for event forecasting and its application to credit defaults

    Manag. Sci.

    (2011)
  • R. Caruana et al.

    Getting the most out of ensemble selection

  • S.A. Cole et al.

    Incentivizing calculated risk‐taking: Evidence from an experiment with commercial bank loan officers

    J. Finance

    (2012)
  • T.G. Dietterich

    Approximate statistical tests for comparing supervised classification learning

    Neural Comput.

    (1998)
  • M.A. Doori et al.

    Credit scoring model based on back propagation neural network using various activation and error function

    Int. J. Comput. Sci. Netw. Secur.

    (2014)
  • Cited by (35)

    • An ensemble credit scoring model based on logistic regression with heterogeneous balancing and weighting effects

      2023, Expert Systems with Applications
      Citation Excerpt :

      Since then, more researchers have explored using multiple basic models to develop ensemble models to further enhance the evaluation effect. The scorecard model is the oldest single structure model and has always been one of the leading technologies in the fields of credit risk analysis (Bequé et al., 2017). It combines the experience of credit experts, gives a particular score to each risk index, and gets the final score by summing up all sub-scores.

    • Boosting credit risk models

      2023, British Accounting Review
    • A novel XGBoost extension for credit scoring class-imbalanced data combining a generalized extreme value link and a modified focal loss function

      2022, Expert Systems with Applications
      Citation Excerpt :

      Table 4 summarizes the critical details of the primary datasets for the experiments. Several credit studies have utilized the Taiwan, GMC and Fraud datasets, including Bequé, Coussement, Gayler, and Lessmann (2017), Junior et al. (2020), Lessmann et al. (2015), Moula, Guotai, and Abedin (2017), Wang, Xu, and Zhou (2015), and Zhang et al. (2020). However, the results are not directly comparable due to different experimental designs.

    • A structural hidden Markov model for forecasting scenario probabilities for portfolio loan loss provisions

      2022, Knowledge-Based Systems
      Citation Excerpt :

      When, instead of few, the context of the problem allows to work with many states, i.e. with a quasi-continuous state space, it becomes no longer necessary to estimate the cutoffs and Tauchen’s original approach can be used. The recent focus of the credit risk literature concerned mostly the default and loss-given-default parameter (e.g. [11–24]), but a few papers started to address the revised accounting regimes: novel aspects of parameter estimation were addressed by Miu and Ozdemir [25] and Krüger et al. [26]. Investigations about the cyclicality of loan loss provisions were conducted by Scheule et al. [27] and Bhat et al. [28].

    View all citing articles on Scopus
    View full text