1 Introduction

With the economic development, social credit system is particularly important. Especially in recent years because of the extensive impact of personal credit, economic development and personal credit relations become more and more close. Personal loans that depend on personal credit are also quietly changing. The traditional personal credit system is mainly based on financial transaction records, to carry out personal credit assessment. But because of the data islanding effect, it is not easy to get a comprehensive financial transaction record. Especially for the emerging P2P industry, access to these data information will be a huge challenge. The traditional way of personal loans is through the artificial way to assess the borrower. This type of loan has many advantages (such as: cumbersome procedures, requiring the borrower to provide comprehensive personal information, etc.). On the contrary, P2P loans make up for the lack of traditional personal loans. However, due to the lack of traditional transaction data assessment, based on personal credit situation P2P loans face a great challenge.

In view of the problems in personal credit evaluation, we have carried on the related research. The study found that social data can make up for the lack of financial information or blanks. In addition, social data is easy to obtain, low cost, and timely update and other advantages. In the current social data, microblogging is currently the most popular social networking platform, and the user’s information to meet our research needs. We decided to use the microblogging data to explore the relevant characteristics of the individual credit, and we use the already mature machine learning algorithm to model the data, such as Support Vector Machine (SVM), Naive Bayesian (NB), Logical regression (LR), and Gradient Boosting (GB). Studies have shown that social data represented by microblogging data has satisfactory results.

2 Relation Work

At the beginning of the individual credit is based on social experience to assess the personal credit situation, and later appeared in the numerical credit evaluation system based on retail credit application data for multiple discriminant analysis and multiple regression analysis [1]. Then statistical methods, artificial intelligence and other methods are used in a large number of credit assessment. Based on the real credit card data, Myers and Forgy use SVM algorithm to have outstanding evaluation ability in credit evaluation [2]. Xiao and Fei use a grid search technique using 5-fold cross-validation to find out the optimal parameter values of various kernel function of SVM [3]; Yu and Yao propose a weighted least squares support vector machine (LSSVM) classifier with design of experiment (DOE) for parameter selection for credit risk evaluation [4]. Although the SVM algorithm has outstanding credit evaluation ability, it does not perform well in the evaluation of a large amount of default data. Jiang and Xie select the representative Logistic regression and radial basis function neural network method, and establishes 2 single evaluation models [5]. Bekhet and Eletter propose two credit scoring models using data mining techniques to support loan decisions for the Jordanian commercial banks. The results indicate that the logistic regression model performed slightly better than the radial basis function model in terms of the overall accuracy rate. However, the radial basis function was superior in identifying those customers who may default [6]. But they all share a common drawback that the credit rating agency needs to obtain a comprehensive financial transaction record for the credit assessor. These financial transaction records are often difficult to obtain. At present, the development of social media is very fast, a large number of social data emerged, these social media data has been used for emotional analysis [7, 8], user location analysis [9], and user relationship analysis [10] and so on. In addition, social data other than traditional financial data has been applied to personal credit assessment. Wilhelm introduced the ZestFinance in the use of social credit data [11]; Li and Liu introduced Alipay as the third party payment credit reference platform can reduce credit risk and improve the credit scale, extend the credit period, increase social welfare, and promote the development of the network of credit [12]. A large number of social media data hidden in the user’s personal credit information, we explore the use of micro-blog data for personal credit evaluation research, analysis of social data on personal credit prediction ability.

3 Feature Description

3.1 Research on Traditional Personal Credit

According to the study of the previous personal credit literature, we summed up the three basic principles: ability, willingness, and stability. Table 1 shows the three principles.

Table 1. Three principles of the traditional personal credit literature and its explanations.

Principle 1: Ability.

Refers to the ability of users to repay. This principle is whether the user can comply with the material basis of commitment and repayment of the loan on time. Have a strong “ability” is usually a good feature of personal credit, and weak “ability” performance is a poor personal credit risk. Many people are unable to fulfill their promises because they are “weak”.

Principle 2: Willingness.

Refers to the user’s repayment wishes. This principle refers to the credit behavior that is generated by the user’s inner activity. A user with a strong “ Willingness “ will do everything possible to fully mobilize the subjective initiative of the individual, to comply with the commitment to maintain a good credit, even in the case of weak “ability”. On the contrary, a poor “ Willingness “ user will tend to delay the repayment time and thus have a negative impact on their own credit, even in the case of strong ability.

Principle 3: Stability.

This principle is to describe the stability of the user to maintain a good credit. Due to a series of factors such as capital flow, major disease, asset change, and job replacement, the user’s personal credit will change with the change of the situation, so as to have a corresponding influence on the user’s personal credit.

3.2 Research on Microblogging Data

Based on these principles, we study the acquired microblogging data, and carried out the relevant feature extraction. We found that the data can be roughly divided into the following three components (as shown in Table 2):

Table 2. Microblogging research data of the three major components.

Part 1: Attributes of Demographic.

This part of the data mainly from the user’s personal data report, which is set by the microblogging operators. This information is for users to fill in at any time or at a later time, including user login, user nickname, gender, date of birth, profile, real name, location, educational information, microblogging, membership, email, and registration time.

Part 2: Tweets Content.

Microblogging is a sharing and exchange platform. The majority of users in this platform to publish tweets with a strong timeliness and arbitrariness. Tweets content express the user’s thoughts and updates at this moment. Users can upload text messages, pictures, and videos, which are mostly unstructured data. In Fig. 1, we show the data of a texted tweets content, where the id has been anonymized.

Fig. 1.
figure 1

User’s a text message.

Part 3: User Relationship Structure.

Weibo is a platform based on user relationship information sharing, dissemination and acquisition. We study the relationship between 10 fans and followers and self-network structures. (as well as ego-network structures.)

4 Data Analysis

We got micro-blog data from nearly 300 thousand users and tested them as objects. In order to protect the privacy of users, we are anonymous to all data, and these user data have been authorized to facilitate our research. The credit labels of these test users have been marked by third party organizations. The credit labels have two labels, “good credit” and “bad credit”. Based on the three principles of traditional personal credit investigation, we have extracted the characteristics of our research data in the second section. However, we found that not all micro-blog data can play a good role in distinguishing personal credit information, and some even affect the judgment of the quality of credit information. We will do further data processing on micro-blog data and select appropriate attributes to be input data for our model.

4.1 Attributes of Demographic

As we mentioned in section second of the Attributes of Demographic (user login, user nickname, gender, date of birth, introduction, real name authentication, location, education information, micro-blog class members, and the mailbox registration time) Pearson correlation coefficient and χ2 statistics. The results obtained are shown in Table 3. Through the research, we find that the “profile” attribute is unstructured data, and the missing value is very serious. When identifying the target tag, the “profile” attribute overlaps with “location” and “educational information”, so we will not analyze it here.

Table 3. Attributes of demographic specific attributes of the Pearson correlation coefficient and χ2 statistics.

Through Table 3 we can see that the “user login name”, “user nickname”, “email” these three information for the credit label is not resolved. “Educational information” may be due to sparse data, the default value of 92.17%. The default value for the “date of birth” attribute is also large, with a default of 70.83%, and its authenticity and accuracy are to be examined, but may be due to the “date of birth” attribute representing a population of a certain age group. Each age group of people on the understanding of social responsibility have different understanding, so the credit label still has a strong linear dependence. Through the analysis, we will be “gender”, “date of birth”, “real name certification”, “location”, “microblogging level”, “member”, “registration time” as our model input attributes. We think these attributes correspond to the principle of ability in traditional credit investigation.

4.2 Tweets Content

Tweets content is the user in the microblogging platform, real-time expression of their own life, emotion, speech plate. Although the user published a tweets and credit labels seem to have no direct relationship. But after our study found that the contents of the tweet can be a good reflection of the user portrait. Since the user’s published tweets are mostly unstructured data, as shown in Fig. 1, we have acquired 7,297,649 tweets data for natural language processing. First of all, according to the credit label, we have extracted the key words, and the results are shown in Table 4.

Table 4. Keyword extraction of tweets content for users of “Good credit” and “Bad credit”.

In the credit good user tag, we found the keywords “wife” and “kids”, which indicates that these users are more concerned with their families. Users have a strong sense of responsibility for the family, this sense of responsibility will enable them to maintain a good credit record. At the same time, words such as “dog”, “cat”, “flower”, “car”, “future” are also appearing as keywords of good credit, and we analyze that a person who is often concerned about the present life will be full of hope for the current life The Love life, looking forward to the future, is a positive and progressive, and strive to forge ahead of the performance. These keywords are consistent with the “Willingness” of one of the three principles mentioned in Sect. 3.1. It is with these positive life performance that the user can have a positive attitude to keep his promise.

On the other hand, in the bad user’s tag, we found keywords such as “win”, “game”, “gaming”, “mahjong”, “play”, which means that in these users, It is hard for people who are addicted to the game to have a positive attitude towards real life. These transfers will not only cost a lot of money, but also may not be able to comply with their commitments in a timely manner. “XiaoMi”, “iphone”, “red packet” and other keywords that these users pay more attention to some of the forward and forward promotional text, which are mostly in the business of promotional text, these users are more concerned about earning a small cheap. At the same time it is also the negative side of the principle “Willingness”.

In addition to the extraction of the key content of the tweets, we also found the following important attributes:

“Original Tweets”: Refers to the Good Credit Users.

These users prefer to show their lives on the microblogging platform, to express their feelings of life, to express their vision of the future, to express their positive emotions. These are the real-time real-life portrayal of users, microblogging platform is displayed on the original tweets. On the contrary, bad credit users, prefer to forward someone else’s tweets, and even repeat the same content of the tweet, which will not only lead to other aesthetic fatigue, but also lead to other people’s resentment. This is an irresponsible performance.

“Tweets Release Time”: People’s Life is Regular.

Good schedule of work is the basis of normal life and work. According to the relevant research found that long-term stay up all night, life is not the law, will lead to chronic diseases. And the time to publish the text is the time of the law of a performance. We have found that users with poor credit often publish tweets at 1: 00–5: 00. Irregular life is likely to lead to some bad behavior, at the same time, this is 2.1 in the three principles of “Stability” negative examples.

“Work and Education Experience”: Refers to the User’s Work and Learning Experience.

This attribute has been discussed in 3.1, but because of the personal information in the education of the default value is large, reaching 92.17%, and in the user’s tweets, often unintentional Mentioned in their own educational information, which to a large extent make up for the lack of educational information defects. Good educational experience, and a higher level of education, is that we believe that good credit performance. Similarly, users often refer to their work in tweets, and experienced employees are more able to comply with the agreement. Higher education usually has a better performance, so we will “work education experience” (Work and education experience) as a property feature.

We sort out the data information of the user tweets in Fig. 1, as shown in Table 5. The tweets shown in Fig. 1 are forwarded tweets without the user’s own view, which is inconsistent with our definition of the original tweets, so it is not original tweets. The time to publish tweets is clearly shown in Fig. 1, showing the content of Wed Oct 17 10:00:12. The tweets shown in Fig. 1 do not show the user’s work and educational information. We will improve the user’s work and educational experience through the demographic characteristics of the user or other tweets of the user. Users may disclose the location information of the user in the tweets content. Figure 1 shows the user’s location as Chaoyang District of Beijing City. In addition, we extract the key words of tweets, and extract the contents of XiaoMi, iPhone, Beijing, lottery, booking.

Table 5. Three principles of the traditional personal credit literature and its explanations.

4.3 User Relationship Structure

As we mentioned earlier, micro-blog is a platform based on user relations, information sharing, dissemination and access. A good relationship structure is a good performance of user credit. Although we have access to a large amount of data information, due to the limitations of relevant laws and regulations and the complexity of the micro-blog user relationship network, we can only extract one-hop network structure. Therefore, we take a certain user as the center and use PageRank algorithm to establish the network structure of the user.

$$ PR\left( {P_{i} } \right) = \alpha \sum\limits_{{p_{{i \in M_{{P_{i} }} }} }} {\frac{{PR\left( {P_{j} } \right)}}{{L\left( {P_{j} } \right)}} + \frac{{\left( {1 - \alpha } \right)}}{N}} $$
(1)

Where \( p_{i} \) is the i-th user, \( {\text{M}}_{{p_{i} }} \) is the monotone network chain of \( p_{i} \), \( {\text{L}}(p_{j} ) \) is the single-hop network chain of user \( p_{j} \), and N is the total number of network chains. Here, we take \( \alpha \) for 0.85.

Good network relationship is the user’s popular performance, but also a good reflection of the user’s credit. We extract the following attributes from the user’s network: the number of fans, the number of followers, the number of concerns, the number of fans/the number of people concerned, the number of each concerned/the number of fans, the number of each concerned/concerned about the number.

5 Experiment Setting

Based on the above feature extraction, we summarize all the useful features for data acquisition, and standardize the data in [0,1], and show in detail in Table 6. We use these features as input to the final model.

Table 6. All valid property characteristics of micro-blog data.

According to these attributes, we extracted 5000 users from all users, including 900 users with good credit and 4100 users with bad credit. This kind of data set is not only balanced, but also brings inconvenience to the research, but this imbalanced data set is more in line with the actual user groups. We used four commonly used models: Support Vector Machine (SVM), Naive Bayesian (NB), Logical regression (LR), and Adaboost.

In the four algorithms, the Adaboost algorithm shows good results. The description of the Adaboost algorithm is shown in Fig. 2. A basic learner is trained from the initial training set. The base learner uses decision tree common algorithm, then according to the performance of the base learner to adjust the distribution of training samples, making the previous base learners do receive more attention in the subsequent training sample, then the sample distribution adjusted based on training under a base learner; this is repeated until the number of base learners reaches the specified value T. Finally, the T base learning devices are weighted together.

Fig. 2.
figure 2

Adaboost algorithm.

Adaboost’s AUC value is only 0.564 under a balanced setting, but a small step of progress yields huge gains in financial credit reporting. Moreover, social data has the characteristics of timeliness, easy access and so on. It has a strong advantage over traditional financial data.

We perform 5 times 5 fold cross validation to compare performance of different models. We compare the four models of SVM, NB, LR, and, and Adaboost, and analyze the validity of different models by using 5 times 5 fold cross validation. As shown in Table 7.

Table 7. Comparison of classification accuracy of SVM, NB, LR, and Adaboost four models 5 times K fold cross validation (unit: %).

It can be seen from the observation results that the K fold Cross (K validation) is used to evaluate the accuracy of the Adaboost model, and it has higher accuracy than the other 3 algorithms. The results showed that both LR and SVM showed poor results, and Adaboost was significantly better than other models in AUC, and NB was slightly worse. In this experiment, because we have many sparse data and missing values, it is the weakness of LR and SVM model. However, Adaboost shows strong performance. The Adaboost model still has a strong advantage when data is not good enough.

6 Conclusion

With the advent of the “Data Age”, we have the opportunity to capture a wide variety of data. Seemingly unrelated data can sometimes be very closely related. Through in-depth research on social data, we find the connection between these heterogeneous data and personal credit, and prove their connection through experiments.

Through the study of social data, this paper analyzes the attributes of personal credit evaluation in social data, and extracts these attributes. The decision tree algorithm is used to analyze the attributes of these attributes to form a base learner, and then the weights of the attributes are adjusted according to the performance of the base learner. The adjusted property features are analyzed again to form another base learner that is repeated until the base learner’s requirements are reached. Using the Adaboost model, these base learners are weighted together. We also compare the Adaboost model with three other common models, highlighting the advantages of the Adaboost model.

We also want to look at other types of social data in the future, and want to combine social data with financial data to get more comprehensive data. More comprehensive data can be used to further study personal credit. We hope that according to more comprehensive personal credit information, the two classification problem will be expanded into multiple classification problems, and different levels of credit evaluation results for different levels of users will be developed.