Abstract
In recent years, peer-to-peer (P2P) lending in China, which is a new form of unsecured financing that uses the Internet, has boomed, but the consequent credit risk problems are inevitable. A key challenge facing P2P lending platforms is accurately predicting the default probability of the borrower of each loan using the default prediction model, which effectively helps the P2P lending platform avoid credit risks. The traditional default prediction model based on machine learning and statistical learning does not meet the needs of P2P lending platforms in terms of default risk prediction because for data-driven P2P lending, credit data have a large number of missing values, are high-dimensional and have class-imbalanced problems, which makes it difficult to effectively train the default risk prediction model. To solve the above problems, this paper proposes a new default risk prediction model based on heterogeneous ensemble learning. Three individual classifiers, extreme gradient boosting (XGBoost), a deep neural network (DNN) and logistic regression (LR), are used simultaneously with a liner weight ensemble strategy. In particular, this model is able to process missing values. After generating discrete and rank features, this model adds missing values to the model for self-training. Then, the hyperparameters are optimized by the XGBoost model to improve the performance of the prediction model. Finally, compared with the benchmark model, the proposed method significantly improves the accuracy of the prediction results. In conclusion, the prediction method proposed in this paper solves the class-imbalanced problem.
Similar content being viewed by others
References
Bergstra, J., Yoshua Bengio, U.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012). https://doi.org/10.1162/153244303322533223
Brown, I., Mues, C.: An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 39, 3446–3453 (2012). https://doi.org/10.1016/j.eswa.2011.09.033
Chen, T., International, C.G.-P. of the 22nd acm sigkdd: U.: XGBoost: a scalable tree boosting system. Dl.Acm.Org. 785–794(2016), (2016). https://doi.org/10.1145/2939672.2939785
Chen, K., Jiang, J., Zheng, F., Chen, K.: A novel data-driven approach for residential electricity consumption prediction based on ensemble learning. Energy. 150, 49–60 (2018)
Cheng, M.Y., Hoang, N.D., Limanto, L., Wu, Y.W.: A novel hybrid intelligent approach for contractor default status prediction. Knowledge-Based Syst. 71, 314–321 (2014). https://doi.org/10.1016/j.knosys.2014.08.009
Crone, S.F., Finlay, S.: Instance sampling in credit scoring: an empirical study of sample size and balancing. Int. J. Forecast. 28, 224–238 (2012). https://doi.org/10.1016/j.ijforecast.2011.07.006
Emekter, R., Tu, Y., Jirasakuldech, B., Lu, M.: Evaluating credit risk and loan performance in online peer-to-peer (P2P) lending. Appl. Econ. 47, 54–70 (2015). https://doi.org/10.1080/00036846.2014.962222
Feng, X., Xiao, Z., Zhong, B., Qiu, J., Dong, Y.: Dynamic ensemble classification for credit scoring using soft probability. Appl. Soft Comput. J. 65, 139–151 (2018). https://doi.org/10.1016/j.asoc.2018.01.021
Genre, V., Kenny, G., Meyler, A., Timmermann, A.: Combining expert forecasts: can anything beat the simple average? Int. J. Forecast. 29, 108–121 (2013). https://doi.org/10.1016/j.ijforecast.2012.06.004
Guo, Y., Zhou, W., Luo, C., Liu, C., Xiong, H.: Instance-based credit risk assessment for investment decisions in P2P lending. Eur. J. Oper. Res. 249, 417–426 (2016). https://doi.org/10.1016/j.ejor.2015.05.050
Haixiang, G., Yijing, L., Yanan, L., Xiao, L., Jinling, L.: BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification. Eng. Appl. Artif. Intell. 49, 176–193 (2016). https://doi.org/10.1016/j.engappai.2015.09.011
Han, L., Han, L., Zhao, H.: Orthogonal support vector machine for credit scoring. Eng. Appl. Artif. Intell. 26, 848–862 (2013). https://doi.org/10.1016/j.engappai.2012.10.005
Ignatov, A.: Real-time human activity recognition from accelerometer data using convolutional neural networks. Appl. Soft Comput. J. 62, 915–922 (2018). https://doi.org/10.1016/j.asoc.2017.09.027
Iwata, K.: Extending the peak bandwidth of parameters for softmax selection in reinforcement learning. IEEE Trans. Neural Networks Learn. Syst. 28, 1865–1877 (2017). https://doi.org/10.1109/TNNLS.2016.2558295
Kaneko, H., Funatsu, K.: Fast optimization of hyperparameters for support vector regression models with highly predictive ability. Chemom. Intell. Lab. Syst. 142, 64–69 (2015). https://doi.org/10.1016/j.chemolab.2015.01.001
Kim, S.Y., Upneja, A.: Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models. Econ. Model. 36, 354–362 (2014). https://doi.org/10.1016/j.econmod.2013.10.005
Krauss, C., Do, X.A., Huck, N.: Deep neural networks, gradient-boosted trees, random forests: statistical arbitrage on the S&P 500. Eur. J. Oper. Res. 259, 689–702 (2017). https://doi.org/10.1016/j.ejor.2016.10.031
Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. J. 14, 554–562 (2014). https://doi.org/10.1016/j.asoc.2013.08.014
Kuncheva, L.I., Faithfull, W.J.: PCA feature extraction for change detection in multidimensional unlabeled data. IEEE Trans. Neural Networks Learn. Syst. 25, 69–80 (2014). https://doi.org/10.1109/TNNLS.2013.2248094
Lessmann, S., Baesens, B., Seow, H.V., Thomas, L.C.: Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur. J. Oper. Res. 247, 124–136 (2015). https://doi.org/10.1016/j.ejor.2015.05.030
Li, H., Mao, X., Wu, C., Yang, F.: Design and Analysis of a General Data Evaluation System Based on Social Networks. (2018)
Liu, J., Liao, X., Huang, W., Yang, J.b.: A new decision-making approach for multiple criteria sorting with an imbalanced set of assignment examples. Eur. J. Oper. Res. 265, 598–620 (2018). https://doi.org/10.1016/j.ejor.2017.07.043
Liu, X., Chuai, G., Gao, W., Zhang, K.: GA-AdaBoostSVM classifier empowered wireless network diagnosis. (2018)
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. (Ny). 250, 113–141 (2013). https://doi.org/10.1016/j.ins.2013.07.007
Malekipirbazari, M., Aksakalli, V.: Risk assessment in social lending via random forests. Expert Syst. Appl. 42, 4621–4631 (2015). https://doi.org/10.1016/j.eswa.2015.02.001
Nascimento, D.S.C., Coelho, A.L.V., Canuto, A.M.P.: Integrating complementary techniques for promoting diversity in classifier ensembles: a systematic study. Neurocomputing. 138, 347–357 (2014). https://doi.org/10.1016/j.neucom.2014.01.027
Osanaiye, O., Cai, H., Choo, K.K.R., Dehghantanha, A., Xu, Z., Dlodlo, M.: Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing. EURASIP J. Wirel. Commun. Netw. 2016, (2016). https://doi.org/10.1186/s13638-016-0623-3
Paleologo, G., Elisseeff, A., Antonini, G.: Subagging for credit scoring models. Eur. J. Oper. Res. 201, 490–499 (2010). https://doi.org/10.1016/j.ejor.2009.03.008
Serrano-Cinca, C., Gutiérrez-Nieto, B.: The use of profit scoring as an alternative to credit scoring systems in peer-to-peer (P2P) lending. Decis. Support. Syst. 89, 113–122 (2016). https://doi.org/10.1016/j.dss.2016.06.014
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014). https://doi.org/10.1214/12-AOS1000
Sun, T., Jiao, L., Liu, F., Wang, S., Feng, J.: Selective multiple kernel learning for classification with ensemble strategy. Pattern Recogn. 46, 3081–3090 (2013). https://doi.org/10.1016/j.patcog.2013.04.003
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48, 1623–1637 (2015). https://doi.org/10.1016/j.patcog.2014.11.014
Sun, J., Lang, J., Fujita, H., Li, H.: Imbalanced Enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf. Sci. (Ny). 425, 76–91 (2017). https://doi.org/10.1016/j.ins.2017.10.017
Tavana, M., Abtahi, A.R., Di Caprio, D., Poortarigh, M.: An artificial neural network and Bayesian network model for liquidity risk assessment in banking. Neurocomputing. 275, 2525–2554 (2018). https://doi.org/10.1016/j.neucom.2017.11.034
Tobback, E., Bellotti, T., Moeyersoms, J., Stankova, M., Martens, D.: Bankruptcy prediction for SMEs using relational data. Decis. Support. Syst. 102, 69–81 (2017). https://doi.org/10.1016/j.dss.2017.07.004
Wang, G., Ma, J., Huang, L., Xu, K.: Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Syst. 26, 61–68 (2012). https://doi.org/10.1016/j.knosys.2011.06.020
Wang, Z., Jiang, C., Ding, Y., Lyu, X., Liu, Y.: A novel behavioral scoring model for estimating probability of default over time in peer-to-peer lending. Electron. Commer. Res. Appl. 27, 74–82 (2018). https://doi.org/10.1016/j.elerap.2017.12.006
Wu, H., Zhang, Z., Yue, K., Zhang, B., He, J., Sun, L.: Dual-regularized matrix factorization with deep neural networks for recommender systems. Knowledge-Based Syst. 145, 46–58 (2018). https://doi.org/10.1016/j.knosys.2018.01.003
Xia, Y., Liu, C., Liu, N.: Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending. Electron. Commer. Res. Appl. 24, 30–49 (2017). https://doi.org/10.1016/j.elerap.2017.06.004
Xia, Y., Liu, C., Li, Y., Liu, N.: A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Syst. Appl. 78, 225–241 (2017). https://doi.org/10.1016/j.eswa.2017.02.017
Xia, Y., Liu, C., Da, B., Xie, F.: A novel heterogeneous ensemble credit scoring model based on bstacking approach. Expert Syst. Appl. 93, 182–199 (2018). https://doi.org/10.1016/j.eswa.2017.10.022
Xiao, H., Xiao, Z., Wang, Y.: Ensemble classification based on supervised clustering for credit scoring. Appl. Soft Comput. J. 43, 73–86 (2016). https://doi.org/10.1016/j.asoc.2016.02.022
Yao, C., Cai, D., Bu, J., Chen, G.: Pre-training the deep generative models with adaptive hyperparameter optimization. Neurocomputing. 247, 144–155 (2017). https://doi.org/10.1016/j.neucom.2017.03.058
Yeh, C.C., Lin, F., Hsu, C.Y.: A hybrid KMV model, random forests and rough set theory approach for credit rating. Knowledge-Based Syst. 33, 166–172 (2012). https://doi.org/10.1016/j.knosys.2012.04.004
Funding
This work was funded by the National Natural Science Foundation of China under Grant Nos. 91846107, 71571058 and Anhui Provincial Science and Technology Major Project under Grant Nos. 16030801121 and 17030801001.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, W., Ding, S., Wang, H. et al. Heterogeneous ensemble learning with feature engineering for default prediction in peer-to-peer lending in China. World Wide Web 23, 23–45 (2020). https://doi.org/10.1007/s11280-019-00676-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-019-00676-y