A new approach in reject inference of using ensemble learning based on global semi-supervised framework

doi:10.1016/j.future.2020.03.047

Future Generation Computer Systems

Volume 109, August 2020, Pages 382-391

https://doi.org/10.1016/j.future.2020.03.047 Get rights and content

Highlights

•
A novel global semi-supervised framework for reject inference is proposed.
•
A novel algorithm that combine multiple classifiers and clustering algorithms is introduced.
•
The framework is proved to outperform several normal techniques.
•
The framework is validated on the real data set.

Abstract

Credit scoring in online Peer-to-Peer (P2P) lending faces a huge challenge, which is the credit scoring models discard rejected applicants. This selective discarding leads to bias in the parameters of the models and ultimately affects the performance of credit evaluation. One approach for handling this problem is to adopt reject inference, which is a technique that infer the status of rejected samples and incorporate the results into credit scoring models. The most popular practice of reject inference is to use a credit scoring model that is only built on accepted samples to directly predict the status of rejected samples. However, the distribution of accepted samples in online P2P lending is different from rejected samples. We propose SSL-EC3, a global semi-supervised framework that merges multiple classifiers and clustering algorithms together to make better use of the information of rejected samples. It uses multiple unsupervised models (clustering algorithms) to explore the internal relationships of all samples, and then incorporates the information into the ensemble of supervised models (classifiers) to help correct initial classification results of rejected samples. In addition, we try to use a dynamic ensemble selection (DES) to select the appropriate ensemble of classifiers for each sample to be classified. Experimental results on the real data sets demonstrate the benefits of the proposed methods over conventional methods based on the reject inference.

Introduction

Credit scoring is an effective tool for assessing the potential default risks of borrowers, which guarantees the interests of platforms and investors [1]. According to borrower history records including personal information and payment records, credit scoring roughly divides borrowers into two classes, good or bad. Traditional credit scoring models use only accepted applicants and ignore rejected since the rejected applicants have no classes labels. This fact leads to sample selection bias problem and even affect the performance of the models. It is unreasonable to use these models to predict the status of all unknown borrowers.

Online Peer-to-Peer (P2P) lending provides a convenient service that allow users to trade directly. One drawback of this convenience is that investors cannot accurately assess the credit of borrowers, and the interests of platforms and investors face enormous challenges. In order to protect the interests, platforms and investors usually set high thresholds for borrowers, which has led to a large number of applicants being rejected. Thus, traditional credit scoring models have biased results in predicting borrowers’ default risks under such case. How to add rejected applicants to credit scoring models has become a big challenge, especially in online P2P lending.

Reject inference technology refers to the use of an approach to infer the status (good or bad) of rejected applicants, and add the results to the establishment process of the credit scoring models [2]. Sohn et al. [3], [4] believe that the nature of reject inference is to solve the data missing problem. They divide the data missing mechanism into three categories, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Different types of data missing have different approaches to deal with. Furthermore, the emergence of reject inference technology avoids the waste of resources and improves the stability of credit scoring models [5], [6], [7].

In recent years, many methods based on reject inference have emerged and successfully applied to the area of credit scoring. Like extrapolation and augmentation [8]. Machine learning models have attracted researchers’ attention.

Machine learning models based on reject inference are roughly divided into two types, supervised models and semi-supervised models. Supervised models have strong predictive power, but cannot exploit the potential information of unlabeled samples [4], [9]. Recently, researches have focused on semi-supervised models [10], [11], [12]. Semi-supervised models simultaneously model both labeled and unlabeled samples. They seem to be naturally designed to reject inference [13]. However, there are many restrictions on semi-supervised in practical applications. For example, the classes of labeled samples should be correct, the distribution of unlabeled samples should be the same or similar to that of labeled samples, etc. The real situations are often not ideal, especially for online P2P lending. As shown in Table 2, we clearly see a obvious difference between accepted (labeled) and rejected (unlabeled) samples in Lending Club.

We focus on how to fully exploit the predictive power of classifiers and the internal relationships between rejected and accepted samples. There are many works have shown that combining multiple classifiers and clustering algorithms can get more stable classification results and play the role of rejected samples [14], [15], [16], [17]. We introduce an ensemble learning by maximizing the consensus among the output of multiple classifiers and clustering algorithms.

In this work, a particular version of combining clustering and classification for ensemble learning (EC3) framework [18] is integrated into an global semi-supervised learning (SSL) to perform reject inference, namely SSL-EC3. We use classifiers combined with clustering algorithms to obtain a better credit scoring model. SSL-EC3 is built on two fundamental hypotheses: (i) the ensemble of classifiers have powerful classification capabilities, which ensure the accuracy of credit scoring models; (ii) the integration of clustering methods can explore the inherent relationships between accepted and rejected samples, which ensure the generalization ability of credit scoring models. Furthermore, we try to adopt a dynamic ensemble selection (DES) automatically select the appropriate classifier for samples [19]. DES is a Python library that implement the advanced dynamic classifier and ensemble selection techniques. Our experimental results show that SSL-EC3 is helpful for reject inference in online P2P lending.

The structure of this paper is as follows. Section 2 gives an overview of various methods based on reject inference in credit scoring models. Based on the existing methods, we describe the proposed SSL-EC3 framework in Section 3. Section 4 describes the data needed for our experiment and necessary preparations. Section 5 describes and discusses the experimental results in detail. Finally, we conclude the paper.

Section snippets

Related work

In this section, we introduce three different types of data missing mechanisms in online P2P Lending and corresponding solutions.

$MCAR$ indicates whether an applicant is accepted or rejected regardless of his/her history records or personal information, but rather random. That is means platforms or investor adopt a method similar to throwing a coin to decide whether to accept applicants [4]. Obviously, platforms or investors will not expose themselves to such risks. Therefore, this situation is

Methodology

Table 1 shows some notations used in this paper. Suppose we have N samples $X = \{x_{1}, \dots, x_{N}\}$ . For accepted samples, we know them belong to 2 different classes $C = \{0, 1\}$ . 0 means the samples are labeled as good, and 1 means bad. We have b1 base classifiers and b2 base clustering algorithms. In order to simplify experiment, each classifier assign only one class label to a sample, and each clustering method only produces 2 clusters. Therefore, these classifiers and clustering algorithms generate g1 = b1 * 2

Experimental setup

In this section, we introduce data sets, experimental steps and the performance indicators for measuring the credit scoring models used in our experiment.

Results discussion

Firstly, we check the performance of EC3 algorithm in artificial data set from three levels, as shown in Fig. 3. From the perspective of attributes dimensionality, we continuously reduce the number of attributes and observe the changes of accuracy, precision, and recall of each model. We can see that as the attributes continue to decrease, models performance gradually deteriorates. The number of attributes equal 16 is a important point. When the number of attributes is less than 16, the

Conclusion

In online P2P lending, many borrowers’ application are rejected. When building a credit scoring model, we need to combine these data to fully assess the potential risks of loans. This paper proposes a framework that combines multiple classifiers with clustering approaches. The ensemble of classifiers can improve the accuracy of credit scoring, and the integration of clustering method can improve the generalization ability of credit scoring.

Experimental results on real data set show that SSL-EC3

CRediT authorship contribution statement

Yan Liu: Data curation, Formal analysis. Xiner Li: Software, Conceptualization, Methodology. Zaimei Zhang: Supervision, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to acknowledge the editor and anonymous reviewers for their comments, which have helped to improve the paper. This work was supported by the National Natural Science Foundation of China (Grant 61702053, 61872135), the Natural Science Foundation of Hunan Province (Grant 2018JJ2066), and the open fund project for innovation platform of universities in Hunan (Grant 11K002).

Yan Liu received the PhD degree in computer science and technology from Hunan University, China, 2010. He is an Associate Professor at the College of Computer Science and Electronic Engineering of Hunan University, China. His research areas include big data, artificial intelligence and parallel and distributed system.

References (37)

SaberiMorteza et al.
A granular computing-based approach to credit scoring modeling
Neurocomputing
(2013)
CrookJonathan et al.
Does reject inference really improve the performance of application scoring models?
J. Bank. Financ.
(2004)
SohnSo Young et al.
Reject inference in credit operations based on survival analysis
Expert Syst. Appl.
(2006)
BanasikJohn et al.
Reject inference, augmentation, and sample selection
European J. Oper. Res.
(2007)
TianYe et al.
A new approach for reject inference in credit scoring using kernel-free fuzzy quadratic surface support vector machines
Appl. Soft Comput.
(2018)
LiZhiyong et al.
Reject inference in credit scoring using semi-supervised support vector machines
Expert Syst. Appl.
(2017)
XiaYufei et al.
A rejection inference technique based on contrastive pessimistic likelihood estimation for p2p lending
Electron. Commer. Res. Appl.
(2018)
TsaiChih-Fong et al.
Credit rating by hybrid machine learning techniques
Appl. Soft Comput.
(2010)
HsiehNan-Chen et al.
A data driven ensemble classifier for credit scoring analysis
Expert Syst. Appl.
(2010)
BückerMichael et al.
Reject inference in consumer credit scoring with nonignorable missing data
J. Bank. Financ.
(2013)

LeeEunkyoung et al.

Herding behavior in online p2p lending: An empirical investigation

Electron. Commer. Res. Appl.

(2012)

NanniLoris et al.

An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring

Expert Syst. Appl.

(2009)

CrookJonathan N. et al.

Recent developments in consumer credit risk assessment

European J. Oper. Res.

(2007)

FeeldersA.J.

Credit scoring and reject inference with mixture models

Int. J. Intell. Syst. Account. Finance Manag.

(2000)

SmithAndrew et al.

A Bayesian network framework for reject inference

KimY. et al.

Technology scoring model considering rejected applicants and effect of reject inference

J. Oper. Res. Soc.

(2007)

ChenG. Gary et al.

The economic value of reject inference in credit scoring

MaldonadoSebastián et al.

A semi-supervised approach for reject inference in credit scoring using SVMs

Cited by (10)

Ensemble methods and semi-supervised learning for information fusion: A review and future research directions
2024, Information Fusion
Advances over the past decade at the intersection of information fusion methods and Semi-Supervised Learning (SSL) are investigated in this paper that grapple with challenges related to limited labelled data. To do so, a bibliographic review of papers published since 2013 is presented, in which ensemble methods are combined with new machine learning algorithms. A total of 128 new proposals using SSL algorithms for ensemble construction are identified and classified. All the methods are categorised by approach, ensemble type, and base classifier. Experimental protocols, pre-processing, dataset usage, unlabelled ratios, and statistical tests are also assessed, underlining the major trends, and some shortcomings of particular studies. It is evident from this literature review that foundational algorithms such as self-training and co-training are influencing current developments, and that innovative ensemble techniques are continuing to emerge. Additionally, valuable guidelines are identified in the review for improving research into intrinsically semi-supervised and unsupervised pre-processing methods, especially for regression tasks.
Credit scoring methods: Latest trends and points to consider
2022, Journal of Finance and Data Science
Citation Excerpt :
Finally, not all ‘Other’ techniques are highly efficient. Some of them, like kNN,30,35,76,95,111 BOW,77 BGEVA,71 SVM variations,75,76 Tobit model116 and others appear to have the lowest values of performance measure. To sum up, LR, CART, and SVM are usually viewed as standard baseline techniques that provide satisfactory results.
Credit risk is the most significant risk by impact for any bank and financial institution. Accurate credit risk assessment affects an organisation's balance sheet and income statement, since credit risk strategy determines pricing, and might even influence seemingly unrelated domains, e.g. marketing, and decision-making. This article aims at providing a systemic review of the most recent (2016–2021) articles, identifying trends in credit scoring using a fixed set of questions. The survey methodology and questionnaire align with previous similar research that analyses articles on credit scoring published in 1991–2015. We seek to compare our results with previous periods and highlight some of the recent best practices in the field that might be useful for future researchers.
Reject inference in credit scoring using a three-way decision and safe semi-supervised support vector machine
2022, Information Sciences
Citation Excerpt :
For example, Maldonado and Paredes proposed a semi-supervised method that used linear surface support vector machines for classification, finding that the accuracy was better than traditional reject inference [17], Li et al. proposed a semi-supervised support vector machine model for reject inference that directly incorporated all reject information into the modeling and applied it to a large consumer loan data set [15], and Xia et al. proposed a new semi-supervised reject inference method that sampled the culled data sets, finding that the results were better than traditional reject inference methods [26]. Mancisidor developed two novel Bayesian models in a semi-supervised framework that combined the auxiliary variables and a Gaussian mixture of neural network parameterization [18], Liu et al. proved that a global semi-supervised framework that combined multiple classifiers and clustering algorithms had several advantages over traditional reject inference methods [16], and Kang et al. found that a graph-based semi-supervised reject inference model that considered the imbalanced accepted data distribution performed better than traditional reject inference models [12]. The most significant advantage of these reject inference models is that they are able to simultaneously model both the accepted and rejected sample sets.
Reject inference is a credit scoring technique that can resolve sample selection bias, with several statistical and machine learning methods having been recently employed to infer the status of rejected samples. This paper proposed a new reject inference method based on a three-way decision and a safe semi-supervised support vector machine (S4VM) model. In credit evaluations, the accepted sample is labeled and the rejected sample is unlabeled. This paper used S4VM to model both the accepted and rejected samples for reject inference. Because of the basic semi-supervised learning assumption that the accepted and rejected sample distributions are similar, this paper used a three-way decision method to filter the rejected samples to ensure the accepted and rejected sample distributions were closer. It was found that this method filtered out some rejected samples that were significantly different from the accepted sample distribution, which reduced the interference in the S4VM low-density separator. The proposed method was verified in four experiments on Chinese credit loan data, with the results verifying the effectiveness of the proposed reject inference S4VM method.
Online peer-to-peer lending: A review of the literature
2021, Electronic Commerce Research and Applications
Citation Excerpt :
Furthermore, voluntary disclosures positively influence funding success and reduce interest rate especially when loan application does not contain personal information (Li et al., 2020b). Proposing and examining credit scoring and default prediction models is a key focus area of publications classified as financial determinates studies (for e.g. Liu et al., 2020b; Wang et al., 2019a; Niu et al., 2019; Zhang et al., 2020b; Rao et al., 2020a,b). Testing proposed models on different platforms and contexts can not only unpack contextual differences if any, but also provide with validated models that can be transferred to industry for application.
This study reviews the literature of online peer-to-peer (P2P) lending from 2008 until 2020 as an emergent but fast spreading phenomenon in the context of digital finance. Previous literature is geographically skewed towards United States and China with focus on determinants of funding success and loan attributes. Recent studies shift from using logit and survival analysis methods to examine funding success and default predictions, towards applying artificial intelligence. There is a controversial debate regarding adopting a self-regulatory approach versus stricter financial institutions-based regulations with a few studies suggesting a hybrid approach. We suggest several avenues for future research, such as examining the determinants and performance of P2P lending platforms in emerging and developing markets; regulatory differences, the effects of behavioral characteristics such as cultural impact, language, information technology literacy, and the innovation quotient on P2P funding attributes; and the relationship between P2P lending and traditional finance channels.
Semi-supervised adapted HMMs for P2P credit scoring systems with reject inference
2023, Computational Statistics
A Monte Carlo simulation framework for reject inference
2023, Journal of the Operational Research Society

View all citing articles on Scopus

Xiner Li is master student of Hunan University. Her research interests include data mining, big data.

Zaimei Zhang received the Ph.D. degree in management science and engineering from Hunan University, China, 2011. She is an Assistant Professor at the School of Economics and Management of Changsha University of Science and Technology, China. Her research interests include financial engineering, big data and artificial intelligence.

View full text

A new approach in reject inference of using ensemble learning based on global semi-supervised framework

Highlights

Abstract

Introduction

Section snippets

Related work

Methodology

Experimental setup

Results discussion

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neurocomputing

J. Bank. Financ.

Expert Syst. Appl.

European J. Oper. Res.

Appl. Soft Comput.

Expert Syst. Appl.

Electron. Commer. Res. Appl.

Appl. Soft Comput.

Expert Syst. Appl.

J. Bank. Financ.

Electron. Commer. Res. Appl.

Expert Syst. Appl.

European J. Oper. Res.

Credit scoring and reject inference with mixture models

Int. J. Intell. Syst. Account. Finance Manag.

A Bayesian network framework for reject inference

Technology scoring model considering rejected applicants and effect of reject inference

J. Oper. Res. Soc.

The economic value of reject inference in credit scoring

A semi-supervised approach for reject inference in credit scoring using SVMs