A novel multi-stage hybrid model with enhanced multi-population niche genetic algorithm: An application in credit scoring
Introduction
In recent years, artificial intelligence and machine learning technology have been greatly developed. In previous studies, several typical classification models have been applied in binary classification, such as linear discriminant analysis (LDA; Fisher, 1936), logistic regression (LR; Hand & Kelly, 2002), decision tree (DT; Li, Ying, Tuo, & Li, 2004), support vector machine (SVM; Huang, Chen, Hsu, Chen, & Wu, 2004), and multilayer perceptron network (MLP; West, 2000).
In general, datasets for machine learning are typically multidimensional. However, irrelevant and redundant features not only reduce the prediction performance of a classification model but can also increase the computational complexity. Feature selection methods are recognized as promising approaches in machine learning, and it is applied to identify the key features to reduce the computing time cost of the classification models and improve the prediction performance. Some previous studies have explored feature selection methods, including Chen and Li, 2010, Hajek and Michalak, 2013, Maldonado, Pérez and Bravo, 2017, Oreski and Oreski, 2014, and Wang, Zhang, Bai, Mao (2017). But, there still remain new capabilities to be discovered and explored.
Ensemble models have also been widely considered to improve the performance of classification models in recent years. Many ensemble models have been applied to machine learning, such as homogeneous ensemble models based on DT, random forest (RF; Friedman, 2001), gradient boosting decision tree (GBDT; Friedman, 2001), and XGBoost (Chen & Guestrin, 2016). The heterogeneous ensemble models, which combine multiple base classifiers, have also garnered widespread attention (Ala’ Raj & Abbod, 2016a, and Ala'Raj and Abbod, 2016b, Xia, Liu, Da and Xie, 2018). Lessmann, Baesens, Seow, and Thomas (2015) proved that the performance of heterogeneous ensembles is frequently superior to individual classifiers. However, how to determine the most effective ensemble model for different datasets has not yet been completely solved. In addition, the problem complexity and computational time of classifier selection in the original feature is usually large. Therefore, effective classifier selection methods should be considered to obtain a more appropriate ensemble model within a certain complexity.
Credit scoring has gained considerable attention in financial industry owing to its importance in credit risk management. A small improvement in credit scoring model can bring large profits to financial institutions, therefore, many artificial intelligence and machine learning models have been applied to credit scoring to verify their performance in binary classification. In this study, we propose a novel multi-stage hybrid model, which combines feature selection and classifier selection, to obtain a superior prediction performance. Furthermore, an enhanced multi-population niche genetic algorithm (EMPNGA) is proposed to combine several filter methods and priori knowledge in feature selection and classifier selection respectively, to enable the acquisition of optimal feature/classifier subset. Then classifier ensemble is used to improve the prediction performance of the model based on these optimal subsets mentioned above. The proposed model is applied to credit scoring to verify its prediction performance in binary classification. The experimental results demonstrate that these multiple stages of the hybrid model have played a significant role in improving the prediction performance and the final prediction performance of the proposed model is superior to other comparative models. This confirms that the proposed model is effective and practical, and provides a new research direction for future machine learning research.
The remainder of this study is organized as follows. Section 2 describes related work regarding genetic algorithm, feature selection and classifier ensemble. Section 3 describes the mechanism of the proposed model. Section 4 presents the experimental design. Section 5 describes the experimental results and comparative analysis. The conclusions and future works are listed in Section 6.
Section snippets
Related work
Our studies in this paper can be divided into three parts in relation to: (1) genetic algorithm, (2) feature selection, and (3) classifier ensemble. As important sub-fields of machine learning research, these issues have attracted much attention from scholars. In this section, these three issues are reviewed and their applications in credit scoring are elaborated.
The proposed multi-stage hybrid model
In this section, the multi-stage hybrid model is presented, and its framework is described in Fig. 1. This hybrid model can be divided into three stages: feature selection, classifier selection, and classifier ensemble. In the feature selection stage, the preprocessed data are used as input data and several filter methods are combined to determine the synthetic feature importance of all the features. The synthetic feature importance combines the respective characteristics of the several filter
Credit datasets
In the experiment, five real-world credit datasets are used to verify the performance of the proposed model. That is, three credit scoring datasets from the UCI Machine Learning Repository (Asuncion & Newman, 2007), namely, Australian, German, and Japanese datasets, PPDai dataset, which is a part of a loan dataset provided by the Chinese internet finance enterprise named PaiPaiDai,1 and GMSC dataset, which is published by a famous data competition platform (Kaggle2
Experimental results
In this section, experiment results are presented to validate the advantages of the proposed model compared to other comparative classifiers and demonstrate the effectiveness of the proposed model. All of the experiments used Python Version 3.6 on a PC with 3.2 GHz Intel CORE i7 processor. The PC had 32 GB of RAM, and ran the Microsoft Windows 7 operating system.
Conclusions and future work
In recent years, artificial intelligence and machine learning technology have made rapid development, and various novel models have been constructed to enhance prediction performance in binary classification. Researchers have conducted numerous valuable explorations in some fields, including feature selection, classifier selection, and classifier ensemble. Although some studies have done a combinatorial research of the above-mentioned approaches, the optimal integration of them has not been
Acknowledgment
The work has been supported by National Natural Science Foundation of China (Nos. 51875503, 51475410), and Zhejiang Natural Science Foundation of China (No. LY17E050010).
References (40)
- et al.
Classifiers consensus system approach for credit scoring
Knowledge-Based Systems
(2016) - et al.
A new hybrid ensemble credit scoring model based on classifiers consensus system approach
Expert Systems with Applications
(2016) - et al.
UCI machine learning repository
(2007) - et al.
Approaches for credit scorecard calibration: An empirical analysis
Knowledge-Based Systems
(2017) Bagging predictors
Machine Learning
(1996)- et al.
Combination of feature selection approaches with SVM in credit scoring
Expert Systems with Applications
(2010) - et al.
A genetic algorithm-based approach to cost-sensitive bankruptcy prediction
Expert Systems with Applications
(2011) - et al.
Xgboost: A scalable tree boosting system
- et al.
Hybrid genetic algorithm and fuzzy clustering for bankruptcy prediction
Applied Soft Computing
(2017) - et al.
Elements of information theory
(1991)
Statistical comparisons of classifiers over multiple data sets
The Journal of Machine Learning Research
Multiple classifier architectures and their application to credit risk assessment
European Journal of Operational Research
Studies in crop variation. I. An examination of the yield of dressed grain from broadbalk
The Journal of Agricultural Science
The use of multiple measurements in taxonomic problems
Annals of Human Genetics
Greedy function approximation: A gradient boosting machine
Annals of Statistics
A comparison of alternative tests of significance for the problem of m rankings
The Annals of Mathematical Statistics
Feature selection in corporate credit rating prediction
Knowledge-Based Systems
Measuring classifier performance: A coherent alternative to the area under the roc curve
Machine Learning
A better beta for the H measure of classification performance
Pattern Recognition Letters
Superscorecards
Ima Journal of Management Mathematics
Cited by (108)
Consumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods
2024, Expert Systems with ApplicationsA novel federated learning approach with knowledge transfer for credit scoring
2024, Decision Support SystemsA Genetic Algorithm-based sequential instance selection framework for ensemble learning
2024, Expert Systems with ApplicationsA shapelet-based behavioral pattern extraction method for credit risk classification with behavior sparsity
2023, Advanced Engineering InformaticsEnriching the green economy through sustainable investments: An ESG-based credit rating model for green financing
2023, Journal of Cleaner Production