Multivariable data imputation for the analysis of incomplete credit data

https://doi.org/10.1016/j.eswa.2019.112926Get rights and content

Highlights

  • A novel iterative imputation method for incomplete credit data is proposed.

  • The method combines an iterative mechanism and a Bayesian network classifier.

  • Our method is less dependent on the hypothesis for probability distribution than other methods.

  • The proposed method suits for both single variable and multivariable missing data.

  • The proposed method is more accuracy and more applicable than other baseline methods.

Abstract

Missing data significantly reduce the accuracy and usability of credit scoring models, especially in multivariate missing cases. Most credit scoring models address this problem by deleting the missing instances from the dataset or imputing missing values with the mean, mode, or regression values. However, these methods often result in a significant loss of information or a bias. We proposed a novel method called BNII to impute missing values, which can be helpful for intelligent credit scoring systems. The proposed BNII algorithm consisted of two stages: the preparatory stage and the imputation stage. In the first stage, a Bayesian network with all of the attributes in the original dataset was constructed from the complete dataset so that both the network structure that implied the dependencies between variables and the parameters at each variable's conditional distributions could be learned. In the second stage, multivariables with missing values were iteratively imputed using Bayesian network models from the first stage. The algorithm was found to be monotonically convergent. The most significant advantages of the method include, it exploits the inherent probability-dependent relationship between variables, but without a specific probability distribution hypothesis, and it is suitable for multivariate missing cases. Three datasets were used for experiments: one was the real dataset from a famous P2P financial company in China, and the other two were benchmark datasets provided by UCI. The experimental results showed that BNII performed significantly better than the other well-known imputation techniques. This suggested that the proposed method can be used to improve the performance of a credit scoring system and to be extended to other expert and intelligent systems.

Introduction

For decades, credit scoring has been used by lenders as a credit risk assessment tool and an important means to reduce information asymmetry (Einav, Jenkins & Levin, 2013). In order to properly assess the borrower's ability and willingness to repay debt on time, financial institutions collect various information about borrowers from their applications and from credit bureaus, including monthly income, outstanding debt, geographical data, borrowing history, and repayment actions (Bequé & Lessmann, 2017). Using a certain expert judgment method or statistical analysis models, they then aggregated the information into a prediction of a borrower's repayment behaviors or profitability (Abdou & Pointon, 2011). In recent years, small and medium enterprises (SMSEs) have played an increasingly important role in maintaining economic growth, easing employment pressure, and facilitating people’s livelihoods (Zhang, Li & Chen, 2014). The increasing number of SMSEs has increased demand for quality credit services. At the same time, the scale of all kinds of consumer credit markets have also experienced rapid growth, which has further stimulated the demand by loan institutions for credit scoring models (Kano, Uchida, Udell & Watanabe, 2011).

Generally, three categories of credit scoring methods have been used. In the early stage, methods based on the subjective experience of credit experts, such as 5C, 5P, and LAPP, were commonly used by loan institutions (Louzada, Ara & Fernandes, 2016). Later, with the promotion of statistical techniques, regression analysis, Linear discriminant analysis (LDA; Fisher, 1936), logistical regression (LR; Sohn et al., 2016, Walker and Duncan, 1967), and Probit regression (Bliss, 1934) were introduced into credit scoring (Wiginton, 1980). An example includes the z-score model that was proposed by Altman (1968) and the risk-calc model of Moody’s. In recent years, machine learning techniques have also been introduced to credit scoring, including k-nearest neighbor (KNN; Zhou et al., 2014), support vector machine (SVM; Chen and Li, 2010, Hens and Tiwari, 2012), decision tree (DT; Kao, Chiu & Chiu, 2012), and neural network (NN; Chun‐Ling and Huang, 2011, West, 2000), and those approaches are regarded as the mainstream techniques in this field (Chen, Ribeiro & Chen, 2016). Due to this, the intelligent expert credit scoring system is widely used by credit institutions such as banks.

However, missing values are ubiquitous when conducting credit scoring on enterprises, especially for SMSEs (Gordini, 2014, Shen et al., 2009). In many applications, credit data have suffered from unavailability, scarcity, and incompleteness (Schafer, 1997). This issue significantly affects the accuracy and usability of credit scoring systems. The causes of missing data are diverse and complicated, and can include an unwillingness to respond to survey questions, data acquisition fraud, and measurement errors. Two strategies have been commonly employed in practice to overcome this challenge. One possible approach is to drop the missing instances from the original dataset, as done by Won, Kim and Bae (2012) or to perform preprocessing to replace the missing values with mean values, as done by Feng et al., 2019, Lessmann et al., 2015, and Florez-Lopez (2010). Such methods work well when the percentage of missing data is quite small and, also, when ignoring a test instance with missing values can be tolerated. However, given the scarcity of credit data, these methods are not always the best option (Schafer, 1997). They have been shown to result in the loss of information and to introduce biases into the credit scoring processes that can prevent the discovery of important credit risk factors and lead to invalid conclusions. Therefore, we mainly focused on data imputation approaches to estimate the missing values under incomplete credit data scenarios. We believe that this work will be of great benefit to improving data quality in the preprocessing process of data mining, and, consequently, to improve the performance of credit scoring models.

We presented a novel missing value imputation method that was demonstrated to be suitable for multivariable missing credit data. The proposed imputation method was inspired by the EM algorithm presented in Dempster, Laird and Rubin (1977), which was used to find the local maximum likelihood parameters of a statistical model by updating the parameters and likelihood function in an alternate iterative fashion. Combining an iterative mechanism and a Bayesian network classifier to estimate the missing values, our proposed method did the following: (1) introduced an iterative strategy that was based on increasing posterior probability to make the imputation results more fitting with the real distribution, which made the algorithm more accurate; (2) decreased the dependence on the hypothesis for probability distribution, which made the algorithm more applicable; and (3) considered all attributes in the original dataset as nodes to construct the Bayesian network in order to make the algorithm suitable for both single variable and multivariable missing data. The proposed method showed a good capability to impute missing values utilizing the entire knowledge in complete datasets, which suggested that it can be beneficial for credit scoring systems and decision makers. The proposed framework represented a significant step toward the development of robust expert and intelligent credit scoring systems.

This paper is organized as follows. in Section 2 we present a literature review of related work. Our proposed BNII algorithm for missing data imputation is described in Section 3. Experimental setting and results are given in Section 4. Finally, Section 5 provides our concluding remarks.

Section snippets

Related work

Since missing data are common in all kinds of statistical analysis work, a great number of techniques have been proposed to deal with the issue. Existing techniques can generally divided into two categories: deletion and imputation methods (Garciarena and Santana, 2017, Hong and Wu, 2011, Purwar and Singh, 2015).

The deletion method includes case deletion and variable deletion. Ignoring cases or variables with missing data is generally a convenient choice when the cardinality of missing data is

Proposed approach

The BNII algorithm proposed in our study consisted of two stages. The first stage was the preparatory stage. In this stage, we created two datasets from the original dataset. The first dataset, denoted as the complete dataset (DComplete), contained records with no missing values. The second dataset, denoted as the incomplete dataset (DMiss), contained those missing records with some missing attribute values. Then, considering all of the attributes in the original dataset as nodes, a Bayesian

Experiments and results

In order to verify the validity of the BNII algorithm in credit scoring, we used three credit datasets as the experimental data. One was from the Renrendai website, a famous P2P financial company in China, and two of them (German and Australia) were the benchmark UCI datasets. The experiments entailed comparing our algorithm with the mode value imputation and EM imputation methods in two aspects: the imputation accuracy and the performance of the credit scoring model after imputation.

Conclusions

We proposed a new imputation method called the BNII algorithm for multivariate missing credit data. The proposed method viewed the imputation of missing values as an optimization problem and solved it by combining an iterative mechanism and data mining techniques. The BNII algorithm consisted of two stages: fully indicating the relationship among different attributes based on the Bayesian network, and iteratively imputing missing values to find better estimates until it reached the local

CRediT authorship contribution statement

Qiujun Lan: Conceptualization, Methodology, Resources, Writing - review & editing, Funding acquisition. Xuqing Xu: Validation, Investigation, Writing - original draft. Haojie Ma: Software. Gang Li: Writing - review & editing.

Declaration of Competing Interest

None.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (Nos. 71871090, 71301047), the Science Foundation of Ministry of Education of China (18YJAZH038), the Hunan Provincial Science & Technology Major Project (2018GK1020), Xinjiang Uygur Autonomous Region research fund and Deakin University ASL 2019 fund. We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

References (45)

  • F. Louzada et al.

    Classification methods applied to credit scoring: systematic review and overall comparison

    Surveys in Operations Research & Management Science

    (2016)
  • J. Luengo et al.

    A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: The good synergy between RBFNs and event covering method

    Neural Network

    (2010)
  • A. Purwar et al.

    Hybrid prediction model with missing value imputation for medical data

    Expert Systems with Applications

    (2015)
  • H. Shahbazi et al.

    A novel regression imputation framework for Tehran air pollution monitoring network using outputs from WRF and CAMX models

    Atmospheric Environment

    (2018)
  • Y. Shen et al.

    Bank size and small- and medium-sized enterprise (SME) lending: Evidence from China

    World Development

    (2009)
  • S.Y. Sohn et al.

    Technology credit scoring model with fuzzy logistic regression

    Applied Soft Computing

    (2016)
  • G. Tutz et al.

    Improved methods for the imputation of missing data by nearest neighbor methods

    Computational Statistics and Data Analysis

    (2015)
  • D. West

    Neural network credit scoring models

    Computers and Operations Research

    (2000)
  • C. Won et al.

    Using genetic algorithm based knowledge refinement model for dividend policy forecasting

    Expert Systems with Applications

    (2012)
  • H.A. Abdou et al.

    Credit scoring, statistical techniques and evaluation criteria: A review of the literature

    Intelligent Systems in Accounting Finance & Management

    (2011)
  • E.I. Altman

    Financial ratios, discriminant analysis and the prediction of corporate bankruptcy

    Journal of Finance

    (1968)
  • F.D. Atem et al.

    Improved conditional imputation for linear regression with a randomly censored predictor

    Statistical Methods in Medical Research

    (2017)
  • Cited by (31)

    • Credit risk prediction based on loan profit: Evidence from Chinese SMEs

      2024, Research in International Business and Finance
    • Common methodological mistakes

      2023, Leadership Quarterly
    • Credit scoring methods: Latest trends and points to consider

      2022, Journal of Finance and Data Science
    View all citing articles on Scopus
    1

    This research work was completed when Gang Li was on ASL in Chinese Academy of Sciences, and we thank Deakin University for the support of ASL 2019 fund.

    View full text