Multivariable data imputation for the analysis of incomplete credit data

doi:10.1016/j.eswa.2019.112926

Expert Systems with Applications

Volume 141, 1 March 2020, 112926

https://doi.org/10.1016/j.eswa.2019.112926 Get rights and content

Highlights

•
A novel iterative imputation method for incomplete credit data is proposed.
•
The method combines an iterative mechanism and a Bayesian network classifier.
•
Our method is less dependent on the hypothesis for probability distribution than other methods.
•
The proposed method suits for both single variable and multivariable missing data.
•
The proposed method is more accuracy and more applicable than other baseline methods.

Abstract

Missing data significantly reduce the accuracy and usability of credit scoring models, especially in multivariate missing cases. Most credit scoring models address this problem by deleting the missing instances from the dataset or imputing missing values with the mean, mode, or regression values. However, these methods often result in a significant loss of information or a bias. We proposed a novel method called BNII to impute missing values, which can be helpful for intelligent credit scoring systems. The proposed BNII algorithm consisted of two stages: the preparatory stage and the imputation stage. In the first stage, a Bayesian network with all of the attributes in the original dataset was constructed from the complete dataset so that both the network structure that implied the dependencies between variables and the parameters at each variable's conditional distributions could be learned. In the second stage, multivariables with missing values were iteratively imputed using Bayesian network models from the first stage. The algorithm was found to be monotonically convergent. The most significant advantages of the method include, it exploits the inherent probability-dependent relationship between variables, but without a specific probability distribution hypothesis, and it is suitable for multivariate missing cases. Three datasets were used for experiments: one was the real dataset from a famous P2P financial company in China, and the other two were benchmark datasets provided by UCI. The experimental results showed that BNII performed significantly better than the other well-known imputation techniques. This suggested that the proposed method can be used to improve the performance of a credit scoring system and to be extended to other expert and intelligent systems.

Introduction

For decades, credit scoring has been used by lenders as a credit risk assessment tool and an important means to reduce information asymmetry (Einav, Jenkins & Levin, 2013). In order to properly assess the borrower's ability and willingness to repay debt on time, financial institutions collect various information about borrowers from their applications and from credit bureaus, including monthly income, outstanding debt, geographical data, borrowing history, and repayment actions (Bequé & Lessmann, 2017). Using a certain expert judgment method or statistical analysis models, they then aggregated the information into a prediction of a borrower's repayment behaviors or profitability (Abdou & Pointon, 2011). In recent years, small and medium enterprises (SMSEs) have played an increasingly important role in maintaining economic growth, easing employment pressure, and facilitating people’s livelihoods (Zhang, Li & Chen, 2014). The increasing number of SMSEs has increased demand for quality credit services. At the same time, the scale of all kinds of consumer credit markets have also experienced rapid growth, which has further stimulated the demand by loan institutions for credit scoring models (Kano, Uchida, Udell & Watanabe, 2011).

Generally, three categories of credit scoring methods have been used. In the early stage, methods based on the subjective experience of credit experts, such as 5C, 5P, and LAPP, were commonly used by loan institutions (Louzada, Ara & Fernandes, 2016). Later, with the promotion of statistical techniques, regression analysis, Linear discriminant analysis (LDA; Fisher, 1936), logistical regression (LR; Sohn et al., 2016, Walker and Duncan, 1967), and Probit regression (Bliss, 1934) were introduced into credit scoring (Wiginton, 1980). An example includes the z-score model that was proposed by Altman (1968) and the risk-calc model of Moody’s. In recent years, machine learning techniques have also been introduced to credit scoring, including k-nearest neighbor (KNN; Zhou et al., 2014), support vector machine (SVM; Chen and Li, 2010, Hens and Tiwari, 2012), decision tree (DT; Kao, Chiu & Chiu, 2012), and neural network (NN; Chun‐Ling and Huang, 2011, West, 2000), and those approaches are regarded as the mainstream techniques in this field (Chen, Ribeiro & Chen, 2016). Due to this, the intelligent expert credit scoring system is widely used by credit institutions such as banks.

However, missing values are ubiquitous when conducting credit scoring on enterprises, especially for SMSEs (Gordini, 2014, Shen et al., 2009). In many applications, credit data have suffered from unavailability, scarcity, and incompleteness (Schafer, 1997). This issue significantly affects the accuracy and usability of credit scoring systems. The causes of missing data are diverse and complicated, and can include an unwillingness to respond to survey questions, data acquisition fraud, and measurement errors. Two strategies have been commonly employed in practice to overcome this challenge. One possible approach is to drop the missing instances from the original dataset, as done by Won, Kim and Bae (2012) or to perform preprocessing to replace the missing values with mean values, as done by Feng et al., 2019, Lessmann et al., 2015, and Florez-Lopez (2010). Such methods work well when the percentage of missing data is quite small and, also, when ignoring a test instance with missing values can be tolerated. However, given the scarcity of credit data, these methods are not always the best option (Schafer, 1997). They have been shown to result in the loss of information and to introduce biases into the credit scoring processes that can prevent the discovery of important credit risk factors and lead to invalid conclusions. Therefore, we mainly focused on data imputation approaches to estimate the missing values under incomplete credit data scenarios. We believe that this work will be of great benefit to improving data quality in the preprocessing process of data mining, and, consequently, to improve the performance of credit scoring models.

We presented a novel missing value imputation method that was demonstrated to be suitable for multivariable missing credit data. The proposed imputation method was inspired by the EM algorithm presented in Dempster, Laird and Rubin (1977), which was used to find the local maximum likelihood parameters of a statistical model by updating the parameters and likelihood function in an alternate iterative fashion. Combining an iterative mechanism and a Bayesian network classifier to estimate the missing values, our proposed method did the following: (1) introduced an iterative strategy that was based on increasing posterior probability to make the imputation results more fitting with the real distribution, which made the algorithm more accurate; (2) decreased the dependence on the hypothesis for probability distribution, which made the algorithm more applicable; and (3) considered all attributes in the original dataset as nodes to construct the Bayesian network in order to make the algorithm suitable for both single variable and multivariable missing data. The proposed method showed a good capability to impute missing values utilizing the entire knowledge in complete datasets, which suggested that it can be beneficial for credit scoring systems and decision makers. The proposed framework represented a significant step toward the development of robust expert and intelligent credit scoring systems.

This paper is organized as follows. in Section 2 we present a literature review of related work. Our proposed BNII algorithm for missing data imputation is described in Section 3. Experimental setting and results are given in Section 4. Finally, Section 5 provides our concluding remarks.

Section snippets

Related work

Since missing data are common in all kinds of statistical analysis work, a great number of techniques have been proposed to deal with the issue. Existing techniques can generally divided into two categories: deletion and imputation methods (Garciarena and Santana, 2017, Hong and Wu, 2011, Purwar and Singh, 2015).

The deletion method includes case deletion and variable deletion. Ignoring cases or variables with missing data is generally a convenient choice when the cardinality of missing data is

Proposed approach

The BNII algorithm proposed in our study consisted of two stages. The first stage was the preparatory stage. In this stage, we created two datasets from the original dataset. The first dataset, denoted as the complete dataset (D_Complete), contained records with no missing values. The second dataset, denoted as the incomplete dataset (D_Miss), contained those missing records with some missing attribute values. Then, considering all of the attributes in the original dataset as nodes, a Bayesian

Experiments and results

In order to verify the validity of the BNII algorithm in credit scoring, we used three credit datasets as the experimental data. One was from the Renrendai website, a famous P2P financial company in China, and two of them (German and Australia) were the benchmark UCI datasets. The experiments entailed comparing our algorithm with the mode value imputation and EM imputation methods in two aspects: the imputation accuracy and the performance of the credit scoring model after imputation.

Conclusions

We proposed a new imputation method called the BNII algorithm for multivariate missing credit data. The proposed method viewed the imputation of missing values as an optimization problem and solved it by combining an iterative mechanism and data mining techniques. The BNII algorithm consisted of two stages: fully indicating the relationship among different attributes based on the Bayesian network, and iteratively imputing missing values to find better estimates until it reached the local

CRediT authorship contribution statement

Qiujun Lan: Conceptualization, Methodology, Resources, Writing - review & editing, Funding acquisition. Xuqing Xu: Validation, Investigation, Writing - original draft. Haojie Ma: Software. Gang Li: Writing - review & editing.

Declaration of Competing Interest

None.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (Nos. 71871090, 71301047), the Science Foundation of Ministry of Education of China (18YJAZH038), the Hunan Provincial Science & Technology Major Project (2018GK1020), Xinjiang Uygur Autonomous Region research fund and Deakin University ASL 2019 fund. We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

References (45)

A. Bequé et al.
Extreme learning machines for credit scoring: an empirical evaluation
Expert Systems with Applications
(2017)
F.L. Chen et al.
Combination of feature selection approaches with SVM in credit scoring
Expert Systems with Applications
(2010)
R. Deb et al.
Missing value imputation for the analysis of incomplete traffic accident data
Information Sciences
(2016)
U. Garciarena et al.
An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers
Expert Systems with Applications
(2017)
N. Gordini
A genetic algorithm approach for SMEs bankruptcy prediction: empirical evidence from Italy
Expert Systems with Applications
(2014)
A.B. Hens et al.
Computational time reduction for credit scoring: an integrated approach based on support vector machine and stratified sampling method
Expert Systems with Applications
(2012)
T.P. Hong et al.
Mining rules from an incomplete dataset with a high missing rate
Expert Systems with Applications
(2011)
M. Kano et al.
Information verifiability, bank organization, bank competition and bank–borrower relationships
Journal of Banking & Finance
(2011)
L.J. Kao et al.
A Bayesian latent variable model with classification and regression tree approach for behavior and credit scoring
Knowledge-Based Systems
(2012)
S. Lessmann et al.
Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research
European Journal of Operational Research
(2015)

F. Louzada et al.

Classification methods applied to credit scoring: systematic review and overall comparison

Surveys in Operations Research & Management Science

(2016)

J. Luengo et al.

A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: The good synergy between RBFNs and event covering method

Neural Network

(2010)

A. Purwar et al.

Hybrid prediction model with missing value imputation for medical data

Expert Systems with Applications

(2015)

H. Shahbazi et al.

A novel regression imputation framework for Tehran air pollution monitoring network using outputs from WRF and CAMX models

Atmospheric Environment

(2018)

Y. Shen et al.

Bank size and small- and medium-sized enterprise (SME) lending: Evidence from China

World Development

(2009)

S.Y. Sohn et al.

Technology credit scoring model with fuzzy logistic regression

Applied Soft Computing

(2016)

G. Tutz et al.

Improved methods for the imputation of missing data by nearest neighbor methods

Computational Statistics and Data Analysis

(2015)

D. West

Neural network credit scoring models

Computers and Operations Research

(2000)

C. Won et al.

Using genetic algorithm based knowledge refinement model for dividend policy forecasting

Expert Systems with Applications

(2012)

H.A. Abdou et al.

Credit scoring, statistical techniques and evaluation criteria: A review of the literature

Intelligent Systems in Accounting Finance & Management

(2011)

E.I. Altman

Financial ratios, discriminant analysis and the prediction of corporate bankruptcy

Journal of Finance

(1968)

F.D. Atem et al.

Improved conditional imputation for linear regression with a randomly censored predictor

Statistical Methods in Medical Research

(2017)

Cited by (31)

Consumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods
2024, Expert Systems with Applications
Credit risk assessment is a crucial element in credit risk management. With the extensive research on consumer credit risk assessment in recent decades, the abundance of literature on this topic can be overwhelming for researchers. Therefore, this article aims to provide a more systematic and comprehensive analysis from three perspectives: classification algorithms, data traits, and learning methods. Firstly, the state-of-the-art classification algorithms are categorized into traditional single classifiers, intelligent single classifiers, hybrid and ensemble multiple classifiers. Secondly, considering the diversity of data traits in the credit dataset, data traits are divided into external structure information traits, data quality traits, data quantity traits, and internal information traits. Data traits-driven modeling framework based on multiple classifiers is proposed for solving credit risk assessment. Thirdly, considering the differences in data modeling methods, learning methods are classified into data status, label status, and structure form. Furthermore, model interpretability, model bias, model multi-pattern, and model fairness are discussed. Finally, the limitations and future research directions are presented. This review article serves as a helpful guide for researchers and practitioners in the field of credit risk modeling and analysis.
Credit risk prediction based on loan profit: Evidence from Chinese SMEs
2024, Research in International Business and Finance
Credit risk prediction should maximize a bank’s loan profit. This paper performs modified profit-based logistic regression (MPLR) by constructing an objective function with the maximum profit as the objective. The optimal weights of two kinds of samples are obtained by constructing an objective function based on the sum of the weighted profit acquired in default and nondefault cases. To obtain greater loan profit, each customer's optimal discrimination threshold is determined by comparing the expected profit that the customer is predicted to produce in the default and nondefault scenarios. The research results show that the predicted and real profits obtained by our model are significantly higher than those obtained by 16 other classification models. The utilized weights can improve the accuracy and profit of the MPLR model, but the discrimination threshold is more important than the weights. The sample balancing process may not necessarily improve the classification accuracy and profit because it can reduce the Type-II error while increasing the induced Type-I error.
Incorporating experts’ judgment into machine learning models
2023, Expert Systems with Applications
Machine learning (ML) models have been quite successful in predicting outcomes in many applications. However, in some cases, domain experts might have a judgment about the expected outcome that might conflict with the prediction of ML models. One main reason for this is that the training data might not be totally representative of the population. In this paper, we present a novel framework that aims at leveraging experts’ judgment to mitigate the conflict. The underlying idea behind our framework is that we first determine, using a generative adversarial network, the degree of representation of an unlabeled data point in the training data. Then, based on such degree, we correct the machine learning model’s prediction by incorporating the experts’ judgment into it, where the higher that aforementioned degree of representation, the less the weight we put on the expert intuition that we add to our corrected output, and vice-versa. We perform multiple numerical experiments on synthetic data as well as two real-world case studies (one from the IT services industry and the other from the financial industry). All results show the effectiveness of our framework; it yields much higher closeness to the experts’ judgment with minimal sacrifice in the prediction accuracy, when compared to multiple baseline methods. We also develop a new evaluation metric that combines prediction accuracy with the closeness to experts’ judgment. Our framework yields statistically significant results when evaluated on that metric.
A review on missing values for main challenges and methods
2023, Information Systems
Several recent reviews summarize common missing value analysis methods. However, none of them provide a systematic and in-depth summary of the analytical challenges and solutions for dealing with missing values. For the purpose of guiding the handling of missing values, this review aims to consolidate current developments in novel missing-value research methodologies. In particular, we comprehensively investigated cutting-edge missing value solutions and methodically studied the main challenges associated with missing values analysis (missing mechanisms, missing patterns, and missing rates). Furthermore, we reviewed 63 publications that compare different strategies for deleting and imputing missing values. Then we investigated data characteristics, highlighted three main problems when analyzing missing values, and analyzed the performance of missing value solutions in these studied papers. Moreover, we conducted comprehensive experiments on 9 public datasets using typical missing value processing methods and provided a simple guided decision tree for handling missing values. Finally, we described current Research hotspots and open challenges, which give potential research topics.
Common methodological mistakes
2023, Leadership Quarterly
For scientific discoveries to be valid—whether in theory or empirically—a phenomenon must be accurately described: The scientist must use appropriate counterfactuals and eliminate competing explanations. Empirical work must also use an appropriate design and method, and empirical claims made about the phenomenon must be correctly characterized. Moreover, valid empirical discoveries must be reliable in the sense that scientists who reexamine the data must be able to reproduce the finding or to replicate the effect from data gathered in a similar context. Only discoveries adhering to the above criteria can be scientifically informative, serve as building blocks for theory, or have policy implications. Unfortunately, as several recent surveys of the literature show, much of the published works in the management and applied psychology fields are uninformative; contributing reasons include several intractable problems in the study design and analysis as well as the failure of the field to adopt open science practices. Against this backdrop, we identify common methodological mistakes made in applied work. We group these mistakes into three major categories: (a) study design and data collection (e.g., fit between hypotheses and methods, design, measurement, open science, literature reviews), (b) data analysis (e.g., data preprocessing, choice of estimators, analysis of data, issues concerning endogeneity, and use of instrumental variables), and (c) diagnostics, inferences, and reporting. We also explain how to avoid these issues, so that published work makes for a useful contribution to the scientific record.
Credit scoring methods: Latest trends and points to consider
2022, Journal of Finance and Data Science
Credit risk is the most significant risk by impact for any bank and financial institution. Accurate credit risk assessment affects an organisation's balance sheet and income statement, since credit risk strategy determines pricing, and might even influence seemingly unrelated domains, e.g. marketing, and decision-making. This article aims at providing a systemic review of the most recent (2016–2021) articles, identifying trends in credit scoring using a fixed set of questions. The survey methodology and questionnaire align with previous similar research that analyses articles on credit scoring published in 1991–2015. We seek to compare our results with previous periods and highlight some of the recent best practices in the field that might be useful for future researchers.

View all citing articles on Scopus

¹: This research work was completed when Gang Li was on ASL in Chinese Academy of Sciences, and we thank Deakin University for the support of ASL 2019 fund.

View full text

Multivariable data imputation for the analysis of incomplete credit data

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed approach

Experiments and results

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Expert Systems with Applications

Expert Systems with Applications

Information Sciences

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Journal of Banking & Finance

Knowledge-Based Systems

European Journal of Operational Research

Surveys in Operations Research & Management Science

Neural Network

Expert Systems with Applications

Atmospheric Environment

World Development

Applied Soft Computing

Computational Statistics and Data Analysis

Computers and Operations Research

Expert Systems with Applications

Credit scoring, statistical techniques and evaluation criteria: A review of the literature

Intelligent Systems in Accounting Finance & Management

Financial ratios, discriminant analysis and the prediction of corporate bankruptcy

Journal of Finance

Improved conditional imputation for linear regression with a randomly censored predictor

Statistical Methods in Medical Research