Elsevier

Electronic Commerce Research and Applications

Volume 31, September–October 2018, Pages 24-39
Electronic Commerce Research and Applications

Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

https://doi.org/10.1016/j.elerap.2018.08.002Get rights and content

Highlights

  • This study uses the new machine learning algorithms to predict the default risk.

  • Two different ways to clean too many variables and missing values data.

  • Comparisons are made between these two equally sophisticated algorithms.

  • Put forward relevant policy recommendations for global P2P platforms.

Abstract

Big data and the Internet financial sector tremendously developed in the 21st century. The national emphasis on this field has also gradually improved. Peer-to-peer (P2P) is an innovative mode of borrowing that is a powerful complement to the traditional financial industry. The projected default rate on credit is an absolute prerequisite for guaranteeing the proper operation of related financial projects or platforms. In this paper, we use ‘multi-observation’ and ‘multi-dimensional’ data cleaning method and apply the modern machine learning algorithms LightGBM in Asia at the end of 2016 and XGboost, which are based on real P2P transaction data from Lending club. The default risk of loans in the platform is strongly and innovatively predicted. And the results of different methods are compared. Furthermore, we observe that the LightGBM algorithm based on multiple observational data set classification prediction results is the best. The average performance rate of the historical transaction data of the Lending Club platform rose by 1.28 percentage points, which reduced loan defaults by approximately $117 million. Finally, with respect to the influencing factors of the default rate, suggested developments for the Lending club and other P2P platforms are provided as is the suggested direction of other countries’ development in this field.

Introduction

Internet finance has rapidly developed due to internet technological improvements and the arrival of big data. Internet finance has been paid great attention whether it is a world-class institution, an organization, or a government or financial sector in various countries, the integration of the Internet and the financial sector as an important practical subject has been carried out. The World Bank has discussed and advised on the development of Internet finance several times. And China has unremittingly promoted sound and fast policies. The concept of “inclusive finance” was written into the resolution of the third plenary session of the 18th CPC central committee on November 12, 2013. On June 5, 2017, a national committee of technical experts on Internet financial security officially launched the ‘sunshine project of national internet finance’. Credit rating related work is key to ensuring a healthy development of Internet finance.

P2P is a typical representative of internet finance, with ‘inclusion’ as its main concept. P2P has the advantages of more being more convenient, faster and more transparent than the traditional methods of finance. It can realize direct connections of investors and borrowers and ensure that both sides reap the benefits. It also results in lower borrowing costs, more convenient channels, and higher returns than the traditional fixed rate for investors. Some existing studies have proved that using a comprehensive credit system based on personal information, such as personal reputation, credit score, and other factors can screen out high-quality borrowers. However, the rapid development of the P2P industry has also been accompanied by an endless number of problems that hinder its development. In 2016, Shuozheng Xu stated that P2P has a role but also has a mixed reputation. Some people praised it as a ‘financial innovation’, while others call it ‘illegal fund-raising’ and a ‘Ponzi Scheme’. With respect to the development of the P2P mortgage industry in China, there have been a large amount of cases (including platform defaults and foreclosures) that have led to huge investors losses. The default rate in China is much higher than that in foreign countries. Meanwhile, risk management control is the key to reducing the default rate and is the core of the development of the P2P industry.

The amount of data is growing explosively in today’s world and the P2P industry is no exception. The application of machine learning is one of the most effective methods in data mining research and is particularly important since it can provide analysts with more information using big data. The general trend is that we apply machine learning methods to risk management controls in the P2P industry. This paper provides materials for national macroeconomic regulation and policy guidance in the development of the P2P industry by analyzing the P2P platform Lending club, which is the biggest in the USA and organized well all over the world. At the same time, two algorithms, LightGBM and XGboost, are applied to this paper’s prediction model of the P2P industry with ideal effect. These methods provide good basis for default forecasting and credit rating. There are three main reasons for choosing LightGBM and XGboost algorithm as research tools in this paper. Firstly, P2P platforms have many review contents for projects and borrowers. It is difficult to predict the risk of default accurately by artificial processing in the case of numerous variables. While LightGBM and XGboost, as machine learning algorithms, can implement default forecast by automatic iteration without manual intervention supervision and have profound theoretical and practical significance in the context of P2P industry default prediction is pursuing automation gradually. Secondly, the LightGBM and XGboost algorithms are the most advanced methods for machine learning that have been developed in recent years. And LightGBM even was first released at the end of 2016. These algorithms have been improved greatly on the basis of previous machine learning algorithms, which improves the prediction accuracy, computational efficiency of the computer and reduces the possibility of over-fitting. Both algorithms have received favourable comments from various scholars in many high-dimensional data analysis and forecasting projects. Thirdly, the data used in the paper to study the default rate of P2P is massive and high-dimensional. It is difficult to achieve the expected research results by general method. Therefore, the LightGBM and XGboost algorithms that are good at dealing with high-dimensional data are very consistent with the research in this paper. However, because the method is relatively novel, the scope of application is not very extensive, and the articles related to it are very rare. Therefore, this paper is also an extension of the application scope of the two algorithms.

Section snippets

Literature review

A large number of economics, management, sociology and information technology studies have researched P2P lending since Zopa was formed in Britain and Prosper opened their transaction data. At present, academic research on P2P network lending platforms is growing.

The related basic theory – GBDT

Ensemble Learning is a branch of machine Learning. It integrates a variety of learning devices (base classifiers) into a new learner based on the specific learning algorithm. Thus, the machine learning method is better than that of a single learner. Boosting is a kind of integrated learning. It is one kind of classification algorithm, which strengthen the weak classifier into a strong classifier by training to achieve accurate classification. The so-called weak classifier is the sub-model

Empirical analysis

This chapter classifies and analyses Lending Club's completed loan data to predict whether the borrower will default in the future based on the LightGBM and XGboost algorithm. This should reduce the default rates of the P2P platform and better identify borrowers’ public information.

Main conclusion

This article applies the LightGBM and XGboost algorithms to a P2P network credit default prediction model. The results are summarized according to the output of each model. The conclusions are as follows.

  • 1.

    The LightGBM algorithm is better than the XGboost algorithm since the ‘multi-observation’ data cleaning method is better than that the ‘multidimensional’ method. In terms of the same algorithm, multi-observation data sets are better than multidimensional data sets. In terms of the same data

References (27)

  • Eunkyoung Lee et al.

    Herding behavior in online P2P lending: an empirical investigation [J]

    Electron. Commer. Res. Appl.

    (2012)
  • H. Li et al.

    Detecting the abnormal lenders from P2P lending data

    Procedia Comput. Sci.

    (2016)
  • M. Malekipirbazari et al.

    Risk assessment in social lending via random forests

    Expert Syst. Appl.

    (2015)
  • Yi Chang

    Construction of Credit Evaluation System For Small and Medium Enterprises of Bank A based on BP neural network [D]

    (2015)
  • Chen, Tianqi, 2016. XGBoost: A Scalable Tree Boosting System. In: ACM SIGKDD Conference on Knowledge Discovery and Data...
  • Chen Cheng

    Study on the early-warning mechanism of smes loan risk based on BP neural network [J]

    Mod. Prop. Manage.

    (2009)
  • C.R. Everett

    Group membership, relationship banking and loan default risk: the case of online social lending

    Appl. Econ.

    (2015)
  • Freedman, S., Jin, G.Z., 2008. Do Social Networks Solve Information Problems for Peer-to-Peer Lending Prosper. Com, NET...
  • Freedman, S., Jin, G.Z., 2011. Learning by Doing with Asymmetric Information: Evidence from Prosper. com NBER Working,...
  • M. Herzenstein et al.

    The Democratization of Personal Consumer Loans? Determinants of Success in Online Peer-to-Peer Lending Communities

    (2008)
  • Iyer, Rajkamal, Khwaja, Asim Ijaz, Luttmer, Erzo F.P., Shue, Kelly, 2010. Screening in New Credit Markets: Can...
  • Weixun Kang

    Research on the Establishment of Credit Rating Model of Comprehensive Oil and Gas Industry Based on ahp [D]

    (2016)
  • Ke, Guolin, Meng, Qi, Finley, Thomas, Wang, Taifeng, Chen, Wei, Ma, Weidong, Ye, Qiwei, Liu, Tie-Yan, 2017. LightGBM: A...
  • Cited by (0)

    This work was supported by: the National Social Science Foundation of China [Project No. 17BTJ020 and 13CTJ005], the National Natural Science Foundation of China [Project Nos. 71772113 and 71272010], Major projects of the National Social Science Foundation in China [Project No. 13&ZD171], the National Social Science Foundation of Liaoning Province [Project No. L16BTJ001 and L17BTJ003], Youth Project for Humanities and Social Science Research, Ministry of Education in China [Project No. 18YJC910013].

    View full text