Multivariable data imputation for the analysis of incomplete credit data
Introduction
For decades, credit scoring has been used by lenders as a credit risk assessment tool and an important means to reduce information asymmetry (Einav, Jenkins & Levin, 2013). In order to properly assess the borrower's ability and willingness to repay debt on time, financial institutions collect various information about borrowers from their applications and from credit bureaus, including monthly income, outstanding debt, geographical data, borrowing history, and repayment actions (Bequé & Lessmann, 2017). Using a certain expert judgment method or statistical analysis models, they then aggregated the information into a prediction of a borrower's repayment behaviors or profitability (Abdou & Pointon, 2011). In recent years, small and medium enterprises (SMSEs) have played an increasingly important role in maintaining economic growth, easing employment pressure, and facilitating people’s livelihoods (Zhang, Li & Chen, 2014). The increasing number of SMSEs has increased demand for quality credit services. At the same time, the scale of all kinds of consumer credit markets have also experienced rapid growth, which has further stimulated the demand by loan institutions for credit scoring models (Kano, Uchida, Udell & Watanabe, 2011).
Generally, three categories of credit scoring methods have been used. In the early stage, methods based on the subjective experience of credit experts, such as 5C, 5P, and LAPP, were commonly used by loan institutions (Louzada, Ara & Fernandes, 2016). Later, with the promotion of statistical techniques, regression analysis, Linear discriminant analysis (LDA; Fisher, 1936), logistical regression (LR; Sohn et al., 2016, Walker and Duncan, 1967), and Probit regression (Bliss, 1934) were introduced into credit scoring (Wiginton, 1980). An example includes the z-score model that was proposed by Altman (1968) and the risk-calc model of Moody’s. In recent years, machine learning techniques have also been introduced to credit scoring, including k-nearest neighbor (KNN; Zhou et al., 2014), support vector machine (SVM; Chen and Li, 2010, Hens and Tiwari, 2012), decision tree (DT; Kao, Chiu & Chiu, 2012), and neural network (NN; Chun‐Ling and Huang, 2011, West, 2000), and those approaches are regarded as the mainstream techniques in this field (Chen, Ribeiro & Chen, 2016). Due to this, the intelligent expert credit scoring system is widely used by credit institutions such as banks.
However, missing values are ubiquitous when conducting credit scoring on enterprises, especially for SMSEs (Gordini, 2014, Shen et al., 2009). In many applications, credit data have suffered from unavailability, scarcity, and incompleteness (Schafer, 1997). This issue significantly affects the accuracy and usability of credit scoring systems. The causes of missing data are diverse and complicated, and can include an unwillingness to respond to survey questions, data acquisition fraud, and measurement errors. Two strategies have been commonly employed in practice to overcome this challenge. One possible approach is to drop the missing instances from the original dataset, as done by Won, Kim and Bae (2012) or to perform preprocessing to replace the missing values with mean values, as done by Feng et al., 2019, Lessmann et al., 2015, and Florez-Lopez (2010). Such methods work well when the percentage of missing data is quite small and, also, when ignoring a test instance with missing values can be tolerated. However, given the scarcity of credit data, these methods are not always the best option (Schafer, 1997). They have been shown to result in the loss of information and to introduce biases into the credit scoring processes that can prevent the discovery of important credit risk factors and lead to invalid conclusions. Therefore, we mainly focused on data imputation approaches to estimate the missing values under incomplete credit data scenarios. We believe that this work will be of great benefit to improving data quality in the preprocessing process of data mining, and, consequently, to improve the performance of credit scoring models.
We presented a novel missing value imputation method that was demonstrated to be suitable for multivariable missing credit data. The proposed imputation method was inspired by the EM algorithm presented in Dempster, Laird and Rubin (1977), which was used to find the local maximum likelihood parameters of a statistical model by updating the parameters and likelihood function in an alternate iterative fashion. Combining an iterative mechanism and a Bayesian network classifier to estimate the missing values, our proposed method did the following: (1) introduced an iterative strategy that was based on increasing posterior probability to make the imputation results more fitting with the real distribution, which made the algorithm more accurate; (2) decreased the dependence on the hypothesis for probability distribution, which made the algorithm more applicable; and (3) considered all attributes in the original dataset as nodes to construct the Bayesian network in order to make the algorithm suitable for both single variable and multivariable missing data. The proposed method showed a good capability to impute missing values utilizing the entire knowledge in complete datasets, which suggested that it can be beneficial for credit scoring systems and decision makers. The proposed framework represented a significant step toward the development of robust expert and intelligent credit scoring systems.
This paper is organized as follows. in Section 2 we present a literature review of related work. Our proposed BNII algorithm for missing data imputation is described in Section 3. Experimental setting and results are given in Section 4. Finally, Section 5 provides our concluding remarks.
Section snippets
Related work
Since missing data are common in all kinds of statistical analysis work, a great number of techniques have been proposed to deal with the issue. Existing techniques can generally divided into two categories: deletion and imputation methods (Garciarena and Santana, 2017, Hong and Wu, 2011, Purwar and Singh, 2015).
The deletion method includes case deletion and variable deletion. Ignoring cases or variables with missing data is generally a convenient choice when the cardinality of missing data is
Proposed approach
The BNII algorithm proposed in our study consisted of two stages. The first stage was the preparatory stage. In this stage, we created two datasets from the original dataset. The first dataset, denoted as the complete dataset (DComplete), contained records with no missing values. The second dataset, denoted as the incomplete dataset (DMiss), contained those missing records with some missing attribute values. Then, considering all of the attributes in the original dataset as nodes, a Bayesian
Experiments and results
In order to verify the validity of the BNII algorithm in credit scoring, we used three credit datasets as the experimental data. One was from the Renrendai website, a famous P2P financial company in China, and two of them (German and Australia) were the benchmark UCI datasets. The experiments entailed comparing our algorithm with the mode value imputation and EM imputation methods in two aspects: the imputation accuracy and the performance of the credit scoring model after imputation.
Conclusions
We proposed a new imputation method called the BNII algorithm for multivariate missing credit data. The proposed method viewed the imputation of missing values as an optimization problem and solved it by combining an iterative mechanism and data mining techniques. The BNII algorithm consisted of two stages: fully indicating the relationship among different attributes based on the Bayesian network, and iteratively imputing missing values to find better estimates until it reached the local
CRediT authorship contribution statement
Qiujun Lan: Conceptualization, Methodology, Resources, Writing - review & editing, Funding acquisition. Xuqing Xu: Validation, Investigation, Writing - original draft. Haojie Ma: Software. Gang Li: Writing - review & editing.
Declaration of Competing Interest
None.
Acknowledgments
This research was supported by the National Natural Science Foundation of China (Nos. 71871090, 71301047), the Science Foundation of Ministry of Education of China (18YJAZH038), the Hunan Provincial Science & Technology Major Project (2018GK1020), Xinjiang Uygur Autonomous Region research fund and Deakin University ASL 2019 fund. We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.
References (45)
- et al.
Extreme learning machines for credit scoring: an empirical evaluation
Expert Systems with Applications
(2017) - et al.
Combination of feature selection approaches with SVM in credit scoring
Expert Systems with Applications
(2010) - et al.
Missing value imputation for the analysis of incomplete traffic accident data
Information Sciences
(2016) - et al.
An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers
Expert Systems with Applications
(2017) A genetic algorithm approach for SMEs bankruptcy prediction: empirical evidence from Italy
Expert Systems with Applications
(2014)- et al.
Computational time reduction for credit scoring: an integrated approach based on support vector machine and stratified sampling method
Expert Systems with Applications
(2012) - et al.
Mining rules from an incomplete dataset with a high missing rate
Expert Systems with Applications
(2011) - et al.
Information verifiability, bank organization, bank competition and bank–borrower relationships
Journal of Banking & Finance
(2011) - et al.
A Bayesian latent variable model with classification and regression tree approach for behavior and credit scoring
Knowledge-Based Systems
(2012) - et al.
Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research
European Journal of Operational Research
(2015)
Classification methods applied to credit scoring: systematic review and overall comparison
Surveys in Operations Research & Management Science
A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: The good synergy between RBFNs and event covering method
Neural Network
Hybrid prediction model with missing value imputation for medical data
Expert Systems with Applications
A novel regression imputation framework for Tehran air pollution monitoring network using outputs from WRF and CAMX models
Atmospheric Environment
Bank size and small- and medium-sized enterprise (SME) lending: Evidence from China
World Development
Technology credit scoring model with fuzzy logistic regression
Applied Soft Computing
Improved methods for the imputation of missing data by nearest neighbor methods
Computational Statistics and Data Analysis
Neural network credit scoring models
Computers and Operations Research
Using genetic algorithm based knowledge refinement model for dividend policy forecasting
Expert Systems with Applications
Credit scoring, statistical techniques and evaluation criteria: A review of the literature
Intelligent Systems in Accounting Finance & Management
Financial ratios, discriminant analysis and the prediction of corporate bankruptcy
Journal of Finance
Improved conditional imputation for linear regression with a randomly censored predictor
Statistical Methods in Medical Research
Cited by (31)
Consumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods
2024, Expert Systems with ApplicationsCredit risk prediction based on loan profit: Evidence from Chinese SMEs
2024, Research in International Business and FinanceIncorporating experts’ judgment into machine learning models
2023, Expert Systems with ApplicationsA review on missing values for main challenges and methods
2023, Information SystemsCommon methodological mistakes
2023, Leadership QuarterlyCredit scoring methods: Latest trends and points to consider
2022, Journal of Finance and Data Science
- 1
This research work was completed when Gang Li was on ASL in Chinese Academy of Sciences, and we thank Deakin University for the support of ASL 2019 fund.