Skip to main content
Log in

Heterogeneous ensemble learning with feature engineering for default prediction in peer-to-peer lending in China

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In recent years, peer-to-peer (P2P) lending in China, which is a new form of unsecured financing that uses the Internet, has boomed, but the consequent credit risk problems are inevitable. A key challenge facing P2P lending platforms is accurately predicting the default probability of the borrower of each loan using the default prediction model, which effectively helps the P2P lending platform avoid credit risks. The traditional default prediction model based on machine learning and statistical learning does not meet the needs of P2P lending platforms in terms of default risk prediction because for data-driven P2P lending, credit data have a large number of missing values, are high-dimensional and have class-imbalanced problems, which makes it difficult to effectively train the default risk prediction model. To solve the above problems, this paper proposes a new default risk prediction model based on heterogeneous ensemble learning. Three individual classifiers, extreme gradient boosting (XGBoost), a deep neural network (DNN) and logistic regression (LR), are used simultaneously with a liner weight ensemble strategy. In particular, this model is able to process missing values. After generating discrete and rank features, this model adds missing values to the model for self-training. Then, the hyperparameters are optimized by the XGBoost model to improve the performance of the prediction model. Finally, compared with the benchmark model, the proposed method significantly improves the accuracy of the prediction results. In conclusion, the prediction method proposed in this paper solves the class-imbalanced problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

References

  1. Bergstra, J., Yoshua Bengio, U.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012). https://doi.org/10.1162/153244303322533223

    Article  MathSciNet  MATH  Google Scholar 

  2. Brown, I., Mues, C.: An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 39, 3446–3453 (2012). https://doi.org/10.1016/j.eswa.2011.09.033

    Article  Google Scholar 

  3. Chen, T., International, C.G.-P. of the 22nd acm sigkdd: U.: XGBoost: a scalable tree boosting system. Dl.Acm.Org. 785–794(2016), (2016). https://doi.org/10.1145/2939672.2939785

  4. Chen, K., Jiang, J., Zheng, F., Chen, K.: A novel data-driven approach for residential electricity consumption prediction based on ensemble learning. Energy. 150, 49–60 (2018)

    Article  Google Scholar 

  5. Cheng, M.Y., Hoang, N.D., Limanto, L., Wu, Y.W.: A novel hybrid intelligent approach for contractor default status prediction. Knowledge-Based Syst. 71, 314–321 (2014). https://doi.org/10.1016/j.knosys.2014.08.009

    Article  Google Scholar 

  6. Crone, S.F., Finlay, S.: Instance sampling in credit scoring: an empirical study of sample size and balancing. Int. J. Forecast. 28, 224–238 (2012). https://doi.org/10.1016/j.ijforecast.2011.07.006

    Article  Google Scholar 

  7. Emekter, R., Tu, Y., Jirasakuldech, B., Lu, M.: Evaluating credit risk and loan performance in online peer-to-peer (P2P) lending. Appl. Econ. 47, 54–70 (2015). https://doi.org/10.1080/00036846.2014.962222

    Article  Google Scholar 

  8. Feng, X., Xiao, Z., Zhong, B., Qiu, J., Dong, Y.: Dynamic ensemble classification for credit scoring using soft probability. Appl. Soft Comput. J. 65, 139–151 (2018). https://doi.org/10.1016/j.asoc.2018.01.021

    Article  Google Scholar 

  9. Genre, V., Kenny, G., Meyler, A., Timmermann, A.: Combining expert forecasts: can anything beat the simple average? Int. J. Forecast. 29, 108–121 (2013). https://doi.org/10.1016/j.ijforecast.2012.06.004

    Article  Google Scholar 

  10. Guo, Y., Zhou, W., Luo, C., Liu, C., Xiong, H.: Instance-based credit risk assessment for investment decisions in P2P lending. Eur. J. Oper. Res. 249, 417–426 (2016). https://doi.org/10.1016/j.ejor.2015.05.050

    Article  MathSciNet  MATH  Google Scholar 

  11. Haixiang, G., Yijing, L., Yanan, L., Xiao, L., Jinling, L.: BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification. Eng. Appl. Artif. Intell. 49, 176–193 (2016). https://doi.org/10.1016/j.engappai.2015.09.011

    Article  Google Scholar 

  12. Han, L., Han, L., Zhao, H.: Orthogonal support vector machine for credit scoring. Eng. Appl. Artif. Intell. 26, 848–862 (2013). https://doi.org/10.1016/j.engappai.2012.10.005

    Article  Google Scholar 

  13. Ignatov, A.: Real-time human activity recognition from accelerometer data using convolutional neural networks. Appl. Soft Comput. J. 62, 915–922 (2018). https://doi.org/10.1016/j.asoc.2017.09.027

    Article  Google Scholar 

  14. Iwata, K.: Extending the peak bandwidth of parameters for softmax selection in reinforcement learning. IEEE Trans. Neural Networks Learn. Syst. 28, 1865–1877 (2017). https://doi.org/10.1109/TNNLS.2016.2558295

    Article  MathSciNet  Google Scholar 

  15. Kaneko, H., Funatsu, K.: Fast optimization of hyperparameters for support vector regression models with highly predictive ability. Chemom. Intell. Lab. Syst. 142, 64–69 (2015). https://doi.org/10.1016/j.chemolab.2015.01.001

    Article  Google Scholar 

  16. Kim, S.Y., Upneja, A.: Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models. Econ. Model. 36, 354–362 (2014). https://doi.org/10.1016/j.econmod.2013.10.005

    Article  Google Scholar 

  17. Krauss, C., Do, X.A., Huck, N.: Deep neural networks, gradient-boosted trees, random forests: statistical arbitrage on the S&P 500. Eur. J. Oper. Res. 259, 689–702 (2017). https://doi.org/10.1016/j.ejor.2016.10.031

    Article  MATH  Google Scholar 

  18. Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. J. 14, 554–562 (2014). https://doi.org/10.1016/j.asoc.2013.08.014

    Article  Google Scholar 

  19. Kuncheva, L.I., Faithfull, W.J.: PCA feature extraction for change detection in multidimensional unlabeled data. IEEE Trans. Neural Networks Learn. Syst. 25, 69–80 (2014). https://doi.org/10.1109/TNNLS.2013.2248094

    Article  Google Scholar 

  20. Lessmann, S., Baesens, B., Seow, H.V., Thomas, L.C.: Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur. J. Oper. Res. 247, 124–136 (2015). https://doi.org/10.1016/j.ejor.2015.05.030

    Article  MATH  Google Scholar 

  21. Li, H., Mao, X., Wu, C., Yang, F.: Design and Analysis of a General Data Evaluation System Based on Social Networks. (2018)

  22. Liu, J., Liao, X., Huang, W., Yang, J.b.: A new decision-making approach for multiple criteria sorting with an imbalanced set of assignment examples. Eur. J. Oper. Res. 265, 598–620 (2018). https://doi.org/10.1016/j.ejor.2017.07.043

    Article  MathSciNet  MATH  Google Scholar 

  23. Liu, X., Chuai, G., Gao, W., Zhang, K.: GA-AdaBoostSVM classifier empowered wireless network diagnosis. (2018)

  24. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. (Ny). 250, 113–141 (2013). https://doi.org/10.1016/j.ins.2013.07.007

    Article  Google Scholar 

  25. Malekipirbazari, M., Aksakalli, V.: Risk assessment in social lending via random forests. Expert Syst. Appl. 42, 4621–4631 (2015). https://doi.org/10.1016/j.eswa.2015.02.001

    Article  Google Scholar 

  26. Nascimento, D.S.C., Coelho, A.L.V., Canuto, A.M.P.: Integrating complementary techniques for promoting diversity in classifier ensembles: a systematic study. Neurocomputing. 138, 347–357 (2014). https://doi.org/10.1016/j.neucom.2014.01.027

    Article  Google Scholar 

  27. Osanaiye, O., Cai, H., Choo, K.K.R., Dehghantanha, A., Xu, Z., Dlodlo, M.: Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing. EURASIP J. Wirel. Commun. Netw. 2016, (2016). https://doi.org/10.1186/s13638-016-0623-3

  28. Paleologo, G., Elisseeff, A., Antonini, G.: Subagging for credit scoring models. Eur. J. Oper. Res. 201, 490–499 (2010). https://doi.org/10.1016/j.ejor.2009.03.008

    Article  Google Scholar 

  29. Serrano-Cinca, C., Gutiérrez-Nieto, B.: The use of profit scoring as an alternative to credit scoring systems in peer-to-peer (P2P) lending. Decis. Support. Syst. 89, 113–122 (2016). https://doi.org/10.1016/j.dss.2016.06.014

    Article  Google Scholar 

  30. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014). https://doi.org/10.1214/12-AOS1000

    Article  MathSciNet  MATH  Google Scholar 

  31. Sun, T., Jiao, L., Liu, F., Wang, S., Feng, J.: Selective multiple kernel learning for classification with ensemble strategy. Pattern Recogn. 46, 3081–3090 (2013). https://doi.org/10.1016/j.patcog.2013.04.003

    Article  Google Scholar 

  32. Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48, 1623–1637 (2015). https://doi.org/10.1016/j.patcog.2014.11.014

    Article  Google Scholar 

  33. Sun, J., Lang, J., Fujita, H., Li, H.: Imbalanced Enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf. Sci. (Ny). 425, 76–91 (2017). https://doi.org/10.1016/j.ins.2017.10.017

    Article  MathSciNet  Google Scholar 

  34. Tavana, M., Abtahi, A.R., Di Caprio, D., Poortarigh, M.: An artificial neural network and Bayesian network model for liquidity risk assessment in banking. Neurocomputing. 275, 2525–2554 (2018). https://doi.org/10.1016/j.neucom.2017.11.034

    Article  Google Scholar 

  35. Tobback, E., Bellotti, T., Moeyersoms, J., Stankova, M., Martens, D.: Bankruptcy prediction for SMEs using relational data. Decis. Support. Syst. 102, 69–81 (2017). https://doi.org/10.1016/j.dss.2017.07.004

    Article  Google Scholar 

  36. Wang, G., Ma, J., Huang, L., Xu, K.: Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Syst. 26, 61–68 (2012). https://doi.org/10.1016/j.knosys.2011.06.020

    Article  Google Scholar 

  37. Wang, Z., Jiang, C., Ding, Y., Lyu, X., Liu, Y.: A novel behavioral scoring model for estimating probability of default over time in peer-to-peer lending. Electron. Commer. Res. Appl. 27, 74–82 (2018). https://doi.org/10.1016/j.elerap.2017.12.006

    Article  Google Scholar 

  38. Wu, H., Zhang, Z., Yue, K., Zhang, B., He, J., Sun, L.: Dual-regularized matrix factorization with deep neural networks for recommender systems. Knowledge-Based Syst. 145, 46–58 (2018). https://doi.org/10.1016/j.knosys.2018.01.003

    Article  Google Scholar 

  39. Xia, Y., Liu, C., Liu, N.: Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending. Electron. Commer. Res. Appl. 24, 30–49 (2017). https://doi.org/10.1016/j.elerap.2017.06.004

    Article  Google Scholar 

  40. Xia, Y., Liu, C., Li, Y., Liu, N.: A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Syst. Appl. 78, 225–241 (2017). https://doi.org/10.1016/j.eswa.2017.02.017

    Article  Google Scholar 

  41. Xia, Y., Liu, C., Da, B., Xie, F.: A novel heterogeneous ensemble credit scoring model based on bstacking approach. Expert Syst. Appl. 93, 182–199 (2018). https://doi.org/10.1016/j.eswa.2017.10.022

    Article  Google Scholar 

  42. Xiao, H., Xiao, Z., Wang, Y.: Ensemble classification based on supervised clustering for credit scoring. Appl. Soft Comput. J. 43, 73–86 (2016). https://doi.org/10.1016/j.asoc.2016.02.022

    Article  Google Scholar 

  43. Yao, C., Cai, D., Bu, J., Chen, G.: Pre-training the deep generative models with adaptive hyperparameter optimization. Neurocomputing. 247, 144–155 (2017). https://doi.org/10.1016/j.neucom.2017.03.058

    Article  Google Scholar 

  44. Yeh, C.C., Lin, F., Hsu, C.Y.: A hybrid KMV model, random forests and rough set theory approach for credit rating. Knowledge-Based Syst. 33, 166–172 (2012). https://doi.org/10.1016/j.knosys.2012.04.004

    Article  Google Scholar 

Download references

Funding

This work was funded by the National Natural Science Foundation of China under Grant Nos. 91846107, 71571058 and Anhui Provincial Science and Technology Major Project under Grant Nos. 16030801121 and 17030801001.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Shuai Ding or Shanlin Yang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, W., Ding, S., Wang, H. et al. Heterogeneous ensemble learning with feature engineering for default prediction in peer-to-peer lending in China. World Wide Web 23, 23–45 (2020). https://doi.org/10.1007/s11280-019-00676-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-019-00676-y

Keywords

Navigation