A new approach in reject inference of using ensemble learning based on global semi-supervised framework

https://doi.org/10.1016/j.future.2020.03.047Get rights and content

Highlights

  • A novel global semi-supervised framework for reject inference is proposed.

  • A novel algorithm that combine multiple classifiers and clustering algorithms is introduced.

  • The framework is proved to outperform several normal techniques.

  • The framework is validated on the real data set.

Abstract

Credit scoring in online Peer-to-Peer (P2P) lending faces a huge challenge, which is the credit scoring models discard rejected applicants. This selective discarding leads to bias in the parameters of the models and ultimately affects the performance of credit evaluation. One approach for handling this problem is to adopt reject inference, which is a technique that infer the status of rejected samples and incorporate the results into credit scoring models. The most popular practice of reject inference is to use a credit scoring model that is only built on accepted samples to directly predict the status of rejected samples. However, the distribution of accepted samples in online P2P lending is different from rejected samples. We propose SSL-EC3, a global semi-supervised framework that merges multiple classifiers and clustering algorithms together to make better use of the information of rejected samples. It uses multiple unsupervised models (clustering algorithms) to explore the internal relationships of all samples, and then incorporates the information into the ensemble of supervised models (classifiers) to help correct initial classification results of rejected samples. In addition, we try to use a dynamic ensemble selection (DES) to select the appropriate ensemble of classifiers for each sample to be classified. Experimental results on the real data sets demonstrate the benefits of the proposed methods over conventional methods based on the reject inference.

Introduction

Credit scoring is an effective tool for assessing the potential default risks of borrowers, which guarantees the interests of platforms and investors [1]. According to borrower history records including personal information and payment records, credit scoring roughly divides borrowers into two classes, good or bad. Traditional credit scoring models use only accepted applicants and ignore rejected since the rejected applicants have no classes labels. This fact leads to sample selection bias problem and even affect the performance of the models. It is unreasonable to use these models to predict the status of all unknown borrowers.

Online Peer-to-Peer (P2P) lending provides a convenient service that allow users to trade directly. One drawback of this convenience is that investors cannot accurately assess the credit of borrowers, and the interests of platforms and investors face enormous challenges. In order to protect the interests, platforms and investors usually set high thresholds for borrowers, which has led to a large number of applicants being rejected. Thus, traditional credit scoring models have biased results in predicting borrowers’ default risks under such case. How to add rejected applicants to credit scoring models has become a big challenge, especially in online P2P lending.

Reject inference technology refers to the use of an approach to infer the status (good or bad) of rejected applicants, and add the results to the establishment process of the credit scoring models [2]. Sohn et al. [3], [4] believe that the nature of reject inference is to solve the data missing problem. They divide the data missing mechanism into three categories, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Different types of data missing have different approaches to deal with. Furthermore, the emergence of reject inference technology avoids the waste of resources and improves the stability of credit scoring models [5], [6], [7].

In recent years, many methods based on reject inference have emerged and successfully applied to the area of credit scoring. Like extrapolation and augmentation [8]. Machine learning models have attracted researchers’ attention.

Machine learning models based on reject inference are roughly divided into two types, supervised models and semi-supervised models. Supervised models have strong predictive power, but cannot exploit the potential information of unlabeled samples [4], [9]. Recently, researches have focused on semi-supervised models [10], [11], [12]. Semi-supervised models simultaneously model both labeled and unlabeled samples. They seem to be naturally designed to reject inference [13]. However, there are many restrictions on semi-supervised in practical applications. For example, the classes of labeled samples should be correct, the distribution of unlabeled samples should be the same or similar to that of labeled samples, etc. The real situations are often not ideal, especially for online P2P lending. As shown in Table 2, we clearly see a obvious difference between accepted (labeled) and rejected (unlabeled) samples in Lending Club.

We focus on how to fully exploit the predictive power of classifiers and the internal relationships between rejected and accepted samples. There are many works have shown that combining multiple classifiers and clustering algorithms can get more stable classification results and play the role of rejected samples [14], [15], [16], [17]. We introduce an ensemble learning by maximizing the consensus among the output of multiple classifiers and clustering algorithms.

In this work, a particular version of combining clustering and classification for ensemble learning (EC3) framework [18] is integrated into an global semi-supervised learning (SSL) to perform reject inference, namely SSL-EC3. We use classifiers combined with clustering algorithms to obtain a better credit scoring model. SSL-EC3 is built on two fundamental hypotheses: (i) the ensemble of classifiers have powerful classification capabilities, which ensure the accuracy of credit scoring models; (ii) the integration of clustering methods can explore the inherent relationships between accepted and rejected samples, which ensure the generalization ability of credit scoring models. Furthermore, we try to adopt a dynamic ensemble selection (DES) automatically select the appropriate classifier for samples [19]. DES is a Python library that implement the advanced dynamic classifier and ensemble selection techniques. Our experimental results show that SSL-EC3 is helpful for reject inference in online P2P lending.

The structure of this paper is as follows. Section 2 gives an overview of various methods based on reject inference in credit scoring models. Based on the existing methods, we describe the proposed SSL-EC3 framework in Section 3. Section 4 describes the data needed for our experiment and necessary preparations. Section 5 describes and discusses the experimental results in detail. Finally, we conclude the paper.

Section snippets

Related work

In this section, we introduce three different types of data missing mechanisms in online P2P Lending and corresponding solutions.

MCAR indicates whether an applicant is accepted or rejected regardless of his/her history records or personal information, but rather random. That is means platforms or investor adopt a method similar to throwing a coin to decide whether to accept applicants [4]. Obviously, platforms or investors will not expose themselves to such risks. Therefore, this situation is

Methodology

Table 1 shows some notations used in this paper. Suppose we have N samples X=x1,,xN. For accepted samples, we know them belong to 2 different classes C=0,1. 0 means the samples are labeled as good, and 1 means bad. We have b1 base classifiers and b2 base clustering algorithms. In order to simplify experiment, each classifier assign only one class label to a sample, and each clustering method only produces 2 clusters. Therefore, these classifiers and clustering algorithms generate g1 = b1 * 2

Experimental setup

In this section, we introduce data sets, experimental steps and the performance indicators for measuring the credit scoring models used in our experiment.

Results discussion

Firstly, we check the performance of EC3 algorithm in artificial data set from three levels, as shown in Fig. 3. From the perspective of attributes dimensionality, we continuously reduce the number of attributes and observe the changes of accuracy, precision, and recall of each model. We can see that as the attributes continue to decrease, models performance gradually deteriorates. The number of attributes equal 16 is a important point. When the number of attributes is less than 16, the

Conclusion

In online P2P lending, many borrowers’ application are rejected. When building a credit scoring model, we need to combine these data to fully assess the potential risks of loans. This paper proposes a framework that combines multiple classifiers with clustering approaches. The ensemble of classifiers can improve the accuracy of credit scoring, and the integration of clustering method can improve the generalization ability of credit scoring.

Experimental results on real data set show that SSL-EC3

CRediT authorship contribution statement

Yan Liu: Data curation, Formal analysis. Xiner Li: Software, Conceptualization, Methodology. Zaimei Zhang: Supervision, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to acknowledge the editor and anonymous reviewers for their comments, which have helped to improve the paper. This work was supported by the National Natural Science Foundation of China (Grant 61702053, 61872135), the Natural Science Foundation of Hunan Province (Grant 2018JJ2066), and the open fund project for innovation platform of universities in Hunan (Grant 11K002).

Yan Liu received the PhD degree in computer science and technology from Hunan University, China, 2010. He is an Associate Professor at the College of Computer Science and Electronic Engineering of Hunan University, China. His research areas include big data, artificial intelligence and parallel and distributed system.

References (37)

  • LeeEunkyoung et al.

    Herding behavior in online p2p lending: An empirical investigation

    Electron. Commer. Res. Appl.

    (2012)
  • NanniLoris et al.

    An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring

    Expert Syst. Appl.

    (2009)
  • CrookJonathan N. et al.

    Recent developments in consumer credit risk assessment

    European J. Oper. Res.

    (2007)
  • FeeldersA.J.

    Credit scoring and reject inference with mixture models

    Int. J. Intell. Syst. Account. Finance Manag.

    (2000)
  • SmithAndrew et al.

    A Bayesian network framework for reject inference

  • KimY. et al.

    Technology scoring model considering rejected applicants and effect of reject inference

    J. Oper. Res. Soc.

    (2007)
  • ChenG. Gary et al.

    The economic value of reject inference in credit scoring

  • MaldonadoSebastián et al.

    A semi-supervised approach for reject inference in credit scoring using SVMs

  • Cited by (10)

    • Credit scoring methods: Latest trends and points to consider

      2022, Journal of Finance and Data Science
      Citation Excerpt :

      Finally, not all ‘Other’ techniques are highly efficient. Some of them, like kNN,30,35,76,95,111 BOW,77 BGEVA,71 SVM variations,75,76 Tobit model116 and others appear to have the lowest values of performance measure. To sum up, LR, CART, and SVM are usually viewed as standard baseline techniques that provide satisfactory results.

    • Reject inference in credit scoring using a three-way decision and safe semi-supervised support vector machine

      2022, Information Sciences
      Citation Excerpt :

      For example, Maldonado and Paredes proposed a semi-supervised method that used linear surface support vector machines for classification, finding that the accuracy was better than traditional reject inference [17], Li et al. proposed a semi-supervised support vector machine model for reject inference that directly incorporated all reject information into the modeling and applied it to a large consumer loan data set [15], and Xia et al. proposed a new semi-supervised reject inference method that sampled the culled data sets, finding that the results were better than traditional reject inference methods [26]. Mancisidor developed two novel Bayesian models in a semi-supervised framework that combined the auxiliary variables and a Gaussian mixture of neural network parameterization [18], Liu et al. proved that a global semi-supervised framework that combined multiple classifiers and clustering algorithms had several advantages over traditional reject inference methods [16], and Kang et al. found that a graph-based semi-supervised reject inference model that considered the imbalanced accepted data distribution performed better than traditional reject inference models [12]. The most significant advantage of these reject inference models is that they are able to simultaneously model both the accepted and rejected sample sets.

    • Online peer-to-peer lending: A review of the literature

      2021, Electronic Commerce Research and Applications
      Citation Excerpt :

      Furthermore, voluntary disclosures positively influence funding success and reduce interest rate especially when loan application does not contain personal information (Li et al., 2020b). Proposing and examining credit scoring and default prediction models is a key focus area of publications classified as financial determinates studies (for e.g. Liu et al., 2020b; Wang et al., 2019a; Niu et al., 2019; Zhang et al., 2020b; Rao et al., 2020a,b). Testing proposed models on different platforms and contexts can not only unpack contextual differences if any, but also provide with validated models that can be transferred to industry for application.

    • A Monte Carlo simulation framework for reject inference

      2023, Journal of the Operational Research Society
    View all citing articles on Scopus

    Yan Liu received the PhD degree in computer science and technology from Hunan University, China, 2010. He is an Associate Professor at the College of Computer Science and Electronic Engineering of Hunan University, China. His research areas include big data, artificial intelligence and parallel and distributed system.

    Xiner Li is master student of Hunan University. Her research interests include data mining, big data.

    Zaimei Zhang received the Ph.D. degree in management science and engineering from Hunan University, China, 2011. She is an Assistant Professor at the School of Economics and Management of Changsha University of Science and Technology, China. Her research interests include financial engineering, big data and artificial intelligence.

    View full text