Keywords

1 Introduction

Bigdata is the foundation of the machine learning. A lot of financial companies including most of the banks value the power of the data. In 2015, the global card fraud rate was 7.76BPFootnote 1(The amount of fraud per 10,000 dollars.). The report of UK antifraud agency CIFASFootnote 2 said that their own agency and affiliates had prevented a total loss about 1 billion pounds by using 325,000 records and related algorithms. In China, many banks and financial enterprises lose more than billions of Yuan per year because of financial fraud. Those are few examples happened in predicting financial fraud, but it can be concluded that can prevent enormous economic losses if financial behaviors forecasts can be deployed to more institutions.

According to the previous data, fraud detection is an important application in predicting financial behaviors [1]. The fraud events are detected from the real huge datasets which record different information about clients and their transaction data. Researchers usually sort those behaviors [2] into different categories and then to solve corresponding problems. There are some common applications in financial fraud detection, such as outlier detection [3, 4], clustering and regression. With the increase of the fraud behaviors, people use more approaches to identify the fraud events. Data mining [5] is a necessary technology in fraud detection. In the early stage, data mining with PCA [6], SVM [7] and other methods have made progress in this field. Meanwhile, statistical models such as Naive Bayes [8,9,10], belief stage [11] and logistic model [12] are appeared in real applications. Starting from 2000, with the rapid development of computer technology and blowout of data volume, some unsupervised [13, 14] algorithms and neural networks [15, 16] have attracted people’s attention, fuzzy neural network [17] is also included in it.

Especially in recent years, deep learning [18, 19] is widely used for forecasting and has made a lot of research achievements in academic field. At the same time, industry and academia also combine [20] statistical model and deep learning to achieve a better result.

We compare some main methods in the following figure in order to show a clearer illustration (Fig. 1).

Fig. 1.
figure 1

Algorithms summary

In this study, we are motivated by the fact that citizens’ deposits in banks are usually used for centralized investment by financial institutions. It is necessary for banks to predict those potential customers. We design an efficient algorithm for deposit prediction based on an existing real datasetFootnote 3 (Fig. 2).

Fig. 2.
figure 2

Integrated network

Our contributions are as follows:

(1) Compared with other studies, different data attributes were taken into account in the algorithm of this paper and then different coding methods were adopted. (2) We restrict the GAN and make it more robust, especially under the condition that the data distributions are approximate.

2 Related Work

Comparing with earlier data volume, today’s data are always huge. So, people can not label data like before, and the useful positive samples with labels are very few in the whole dataset. The network has to learn more nonsense information instead of useful features. Meanwhile, the study of Generative Adversarial Nets (GAN) [21] provides us a new idea to exploit the power of data, according to the application of image processing [22], we can further use it in other fields.

Data attributes influence a lot in the following procedure, it can decide which classifier you will use and also indicate the method that used in learning features. This problem pushes us to think twice how to process the original data before it passed to classifiers and do not influence the data attributes at the same time. Data encoding is a normal method to represent raw data, one-hot encoding [23] has been proved that is an efficient way in machine learning to keep data attributes. We notice that it is unreliable when relying on only one encoding method. Inspired by some popular algorithms [24] in deep learning. Through observation, we find that the data always can be divided into two categories, one is no correlation between data, we call it objective data. Another part of the data is index data, and we need to consider its numerical attributes.

Hybrid Encoding: In the financial datasets, there are not only objective data, such as occupations, ages and so on, but also index data of financial industry. It is bad for the final classification if those data are normalized into the same format, so we propose a hybrid coding method to avoid this problem. One-hot encoding is used for objective data. In the early research, this method is often used in the encoding related to FPGA [23, 25]. Because of the convenience of representation and the small number of bytes, logic circuit programming prefers one-hot encoding too. Recently, people also use it in machine learning [26]. This encoding method maintains data independence for the representation of discrete (objective) data. For the index data, we normalize it into an integer, because continuous (index) data are often associated with each other. The advantage of the hybrid coding can take full account of the different attributes of data, rather than violent transformation.

Constrained GAN: GAN has been successfully applied in images [27]. NVIDIA group [28] used the GAN to generate high quality human face pictures which are very similar to real face. This technology has developed quiet well and is constantly applied to various applications, fake pictures made by GAN are hard to be distinguished by human eyes today. Actually, GAN uses the KL divergence to do the optimization. However, we find that the distribution of positive and negative samples in dataset are very similar, so we introduce the MD [29] as the constraint in the GAN. With the help of MD, the new generated positive samples are closer to the original data.

Integrated Classifiers: The successful application of AdaBoost [30] tells us that the combination of weak classifiers can become a strong classifier. In fact, the idea of model fusion comes from industry more. Industry often uses the different advantages of multiple models to integrate a better performance model [31]. This method can often achieve better experimental results. But in practical application, we find that is not so simple to apply, it still need to select the better performanced classifiers, and combine them according by the voting rules. In this process, the important thing is to drop those classifiers with the worst results.

3 Designed Method

3.1 Hybrid Encoding

STEP 1: Our work distinguishes the data which is discrete (objective) or continuous (index). For discrete data, we encode it by the one-hot encoding. Two-bit binary numbers can represent four different combinations. There are no more than nine sub-categories in our dataset, so 4-bit binary digits can meet the requirement. Every main feature is encoded by four-bit binary, even though some contain only a few categories. We process data like this because that can prevent the difference of binary digits between different categories from affecting the result of classification. Experiments show that the uniform representation of digits can ultimately improve the accuracy by 1% \( - \) 2%, and it won’t bring the dimension disaster.

STEP 2: The continuous data are unified into numerical data within 10. The maxi-mum occurrence is assigned 10, and the minimum occurrence is 0.

STEP 3: After forming the data according to the above encoding methods, one-hot encoding are put in the high position and continuous data are put in the low position to form such a data sequence. As shown in the Table 1.

Table 1. The format of hybrid encoding

3.2 Constrained GAN

GAN often used in the field of image classification and video tracking, but in the application scenario of financial data, we can still turn the data into corresponding pixel values. Even if every data is meaningless pixel values, the corresponding pictures are statistically significant. In this case, positive samples are changed into corresponding pixel [22] values to form a picture. These pictures are sent into the new network for training to produce new pictures, then the pictures are converted into corresponding data. This process can complete the whole enrichment of positive samples.

The objective of optimization can be written as:

$$\begin{aligned} MinMax[V(G, D)]=\mathbb {E}_{x \in P_{data}} [logD(x)]+\mathbb {E}_{z \in P_{z}} [log(1-D(Z))] \quad \end{aligned}$$
(1)

To optimiz the discriminator, we need to train the discriminator continu-ously so that it can identify the probability from the real data to the largest. We can rewrite the Eq. (1) as:

$$\begin{aligned} V(G, D)=P_{data}(x)logD(x)+P_z(Z)log(1-D(Z)) \quad \end{aligned}$$
(2)

Then, the solution D\('\) is:

$$\begin{aligned} D^\prime =\frac{P_{data}(x)}{P_{data}(x)+P_z(Z)} \quad \end{aligned}$$
(3)

Bringing (3) back to (2):

$$\begin{aligned} \begin{aligned} V(G,D^\prime )=\,&\mathbb {E}_{x \in P_{data}}\left[ log{\frac{P_{data}(x)}{P_{data}(x)+P_z(Z)}}\right] \\&+P_{z}\left[ log\left( 1-D\left( \frac{P_{z}(Z)}{P_{data}(x)+P_z(Z)}\right) \right) \right] \quad \end{aligned} \end{aligned}$$
(4)

K-L [32] is used to measure the similarity between two probabilities, it can be defined as:

$$\begin{aligned} D_{KL}(P \Vert Q)=\sum _{i=1}^N P(x_i)log\frac{P(x_i)}{Q(x_i)} \quad \end{aligned}$$
(5)

With the K-L, we can further rewrite (4) in:

$$\begin{aligned} V(G,D^\prime )= & {} -2log2+KL(P_{data}(x) \vert \vert A)+KL(P_z(Z) \vert \vert A) \quad \end{aligned}$$
(6)
$$\begin{aligned} A= & {} \frac{P_{data}(x)+P_z(Z)}{2} \quad \end{aligned}$$
(7)

Now, our next goal is to minimize \( P_z(Z)log(1-D(Z))\). According to the method of gradient descent, we can get the following results by using D\('\), the \(P_Z'\) is the solution of G:

$$\begin{aligned} {P_z}^\prime \leftarrow (P_z - \eta {\partial V(G, D^*))} \quad \end{aligned}$$
(8)

The MD is defined as:

$$\begin{aligned} D_{MD}^2= & {} (x-m)^T C^{-1} (x-m) \quad \end{aligned}$$
(9)
$$\begin{aligned} C= & {} (x-m)(x-m)^T \quad \end{aligned}$$
(10)

x represents the whole pattern vectors, m means one of specific vector in the whole. In this paper, we use data(x) replaces x, z means m. The constrained GAN is:

$$\begin{aligned} f_{goal}=V(G^* ,D^*)+\lambda (P_{data}(x)-P_z(Z))^T C^{-1} (P_{data}(x)-P_z(Z)) \quad \end{aligned}$$
(11)

This formula (11) is introduced into GAN as a new objective function, from which some newly generated data can be obtained.

3.3 Integrated Classifiers

In the experiment, we found that the positive and negative sample distributions of some attributes are very similar. It is the reason why we choose different methods to integrate a stronger classifier. According to the latest paper [33], if the distributions of the two kinds of data are similar, it is hard to find a way to separate the two kinds of data away clearly. The practical approach is to integrate different classifiers, and the different features can be used effectively by different classifiers.

Firstly, we tested some common classification algorithms. For example, Decision tree model and Random Forest model. Due to the simplification of algorithm and flexibility in handling the multiple data attribute types, Decision tree [34, 35] are widely used in classification problem. Random Forest [36] is a combination of multiple tree predictors such that each tree depends on a random independent dataset and all trees in the forest are the same distribution.

Then, in order to further understand the performance of weak classifiers are integrated into strong classifiers, this paper further tested the performance of some classifiers, such as AdaBoost, XGBoost and Gradient Boosting. Adaboost is the original edition of weak classifier integrated into the strong classifier. However, unlike AdaBoost, Gradient Boosting [37] chooses the direction of gradient descent in iteration to ensure that the final result is the best, Gradient Boosting can achieve highly accurate. XGBoost [38] implements machine learning algorithms under the Gradient Boosting framework. It is an optimized distributed gradient boosting library which is used by data scientists to achieve state-of-the-art results on many machine learning challenges. Compared with the Gradient Boosting, the advantage of XGBoost is a regular term added to it first. Then, the error of XGBoost loss function is second-order Taylor expansion and Gradient Boosting is first-order Taylor expansion, so the loss function approximation is more accurate.

Last, due to the outstanding performance of Multi-layer Perceptron (MLP) [39] and Naive Bayes in classification tasks, these two algorithms are also included in our candidate list. As the initial model of neural network, MLP has been used in a variety of applications. Performances in some statistical models can be improved in the neural networks. The Naive Bayes [40] method is a set of supervised learning algorithms. It is based on Bayes theory and assumes that each pair of features is independent. Although the assumption is simple, Naïve Bayes classifier works well on many real classification problems, such as document classification and spam filtering. It only needs a small amount of training data to estimate the necessary parameters. Linear Regression (LR) [41] [42] and Stochastic Gradient Descent (SGD) [43] added to this framework, however, the performances are not satisfied.

Table 2. Recall of different classifiers in predicting positive samples. The ratio in Table. 2 represents the ratio of positive samples to negative samples. In practice, we found such an interesting phenomenon: not the more joint cascaded classifiers mean the better result. Because some classifiers have the negative impact on the results, we have to discard some of them which has poor result, that is why we chose 4 form 9 classifiers finally.

After comprehensive measurement, we chose the final four classifiers: Decision Tree, Gradient Boosting (GB), XGBoost and Naive BayesFootnote 4. The voting rules are defined as:

$$\begin{aligned} \mathbb {P}(x)=\omega _1 * P_{Decisiontree}+\omega _2 * P_{Naive Bayes}+\omega _3 * P_{GB}+\omega _4 * P_{XGBoost} \end{aligned}$$
(12)

In Eq. (12), \( \omega \) means the weight of final training within [0, 1], and P represents the possibility of current prediction.

Here, we need to explain a question. GB and XGBoost are very similar, why we keep both two methods? In fact, after our experiments, the comprehensive experimental results of deleting GB are not better than preserve it, so we chose to retain them both.

3.4 Algorithm Framework

figure a

4 Experiments

4.1 Data Analysis

This dataset contains some basic information about customers. It includes ages, occupations, education and some macroeconomic indicators of the current time. A total of 20 features data, including about 40,000 customer information, of which more than 4,000 persons are marked as positive samples. The statistical results show that the distribution of positive and negative samples with many feature components are very similar (Fig. 3). Considering that, we introduce a constrained GAN naturally. Another point should to know is that we found that some features have no correlation with the result, which are discarded as noise value in our subsequent processing.

Finally, we chose the item 1, 2, 3, 4, 6, 7, 8, 9, 10, 12, 14, 15, 16 from the original dataset as the final feature components.

Fig. 3.
figure 3

Part of the data distributions

4.2 The Result of Different Encoding Methods

In machine learning, recall and precision are contradictory indicators. In this problem, we focus on how many customers that institutes want to pick correctly by our algorithm, and we use the recall to evaluate models.

$$\begin{aligned} Recall=\frac{TP}{TP+FN} \end{aligned}$$
(13)

where TP represents True Positive, FN means False Negative. Four data sets are divided into positive and negative samples with ratios of 1 : 1(4k : 4k), 10 : 1(10k : 1k), 20 : 1(20k : 1k) and 30 : 1(30k : 1k). The \(X-axis\) represents the result of the division of this proportion. The validation set consists of 200 positive samples and 800 negative samples.

We evaluated the performance of different encodings. They are: pure binary coding (All data are encoded in one-hot encoding mode.), pure numerical coding (Data are encoded in integer mode.), and hybrid encoding mode proposed by us. This experiment concentrate on the samples with labeled 1.

Fig. 4.
figure 4

Processing data in image way.

We can see that different coding methods have different effects on classification recognition. The hybrid encoding method is obviously superior to other methods. At the same time, with the increase of the proportion of negative samples and positive samples, the accuracy decreases rapidly.

4.3 Generating Positive Samples

  1. (a)

    Firstly, the dataset of positive and negative samples are divided into two data sets. Then, according to the way in Sect. 3, the objective data and the index data are both converted into integer values. All numerical constraints are in the range of 0 to 255, then the data are stored as a matrix. In this way, two matrix graphs are obtained.

  2. (b)

    Cutting two matrix graphs into several pictures with size \( 20*13 \).

  3. (c)

    Sending the pictures which processed in (2) to the constrained GAN to produce images.

  4. (d)

    All pictures are inversely transformed according to the process in step 1 to obtain new data (Fig. 5).

Fig. 5.
figure 5

Processing data in image way. We show the result about transformed data and a new generating picture in Fig. 4. This paper used the size of \( 20*13 \) to show the data. 20 represents training with 20 users in once, and 13 means thirteen feature components. The last column represents label.

4.4 The Constrained GAN

Previous experiment shows:

  1. (a)

    Unbalanced samples have a negative impact on the final recognition. An order of magnitude difference between positive and negative samples can cause a rapid decline in accuracy. Integrated classifiers do improve classification performance.

  2. (b)

    The encoding method influences the final classification results.

Next, in order to test the performance of enriching positive samples with con-strained GAN, we increased the number of positive samples from 4,000 to 10,000, and keep the ratio of positive and negative in 1:1 (10k : 10k) (Tables 3 and 4).

Table 3. Result of enriching positive samples
Table 4. F1 Score of enriching positive samples

After enriching positive samples, the recall and F1 score were improved, which indicates that our method is effective. However, we want to enhance that the way of increasing positive samples should not increase indefinitely (The correct proportion of positive and negative samples should be consistent with real situation, this article also keep the positive and negative ratio in 1:4.). Enriching positive samples can make the recall higher when the proportion of positive and negative samples are extremely unbalanced. but if the number of positive samples lose restriction to increase, it will only lead to inadequate learning of negative samples, even the accuracy of positive samples are improved, it becomes a meaningless digital game.

5 Conclusion

In this article, we proposed a hybrid encoding and sample enrichment method. Experiments show the proposed methods achieve an obvious performance improvement. Firstly, data with different attributes are transformed into different codes, and then positive samples are enriched by using constrained GAN. Constrained GAN can avoid producing false data when the distribution of positive and negative are similar. Finally, we select some stable and reliable classifiers with good performance through experiments, and use these classifiers to integrate a soft classifier. Our method works well on dataset mentioned in this paper.