1 Introduction

In software project, software fault is an inescapable problem. Software fault may be incurred by internal defects of software or external attacks. Many cases show that software fault can cause huge loss and catastrophic consequences. For example, in 1962, the famous software fault resulted in the failure of Mariner rocket to Venus. In 2003, the blackouts of the Northeastern United States were also because of software fault. In 2009, attackers launched offensive to the video software. This caused extensive software fault that people of 6 provinces in China could not access internet.

Software fault is closely related to security, reliability, maintainability of system [1]. Especially for high-risk system, software fault can lead to serious consequences. In software project, it is quite difficult for testers to find all the software faults. Researchers focus on software fault prediction, which can help tester estimate the number and distribution of fault reasonably. Researchers have studied on the metrics which are used to represent attributes of software. These attributes are quite helpful for software fault prediction, which can be used as features to predict software fault. The classical metrics include LOC count, McCabe [2, 3] and Halstead [2, 4].

Machine learning is always used in software fault prediction. 22 classifiers based on machine learning were used for software fault prediction in [5]. L Kumar set up the model of Least Square Support Vector Machine (LSSVM) for software fault prediction [6]. DR Ibrahim used random forest based on improved feature for software fault prediction [7]. An approach of decision tree for software fault prediction was proposed by Rathore [8]. Logistic Regression was compared with decision tree to enhance the result of software fault prediction [9].

Though these machine learning techniques are applied for software fault prediction, an important problem of software fault is ignored. That is imbalanced data [10]. Taking software fault for example, the amount of non-fault data (majority) is always well above the amount of fault data (minority) in software project. Especially, the fault data (minority) will always be predicted to the non-fault data (majority). How to resolve the difficulty of imbalanced data? There are usually two kinds of ways to deal with this problem. They are under-sampling [11] and over-sampling [12]. Under-sampling random reduces the amount of majority to balance the class of majority and minority. But it will bring out useful information loss. Usually, over-sampling is adopted. SMOTE (synthetic minority over-sampling technique) [13] is a famous over-sampling technique, which is very useful to resolve the problem of imbalanced data.

Can we utilize deep learning techniques for software fault prediction? In recent years, deep learning is widely used in many fields, such as image recognition, natural language processing [14] and voice recognition. It has achieved a resounding success. While up to present, it is seldom applied in the domain of software fault prediction.

It can be found that most applications of Variational Autoencoder(VAE) are used for image processing [15, 16]. Some of them are used for text generating [17]. Similarly, GAN (Generative Adversarial Networks) is always used for images [18, 19]. The framework of VAE is a generative model [20]. The framework of GAN is combined by a generative model and a discriminative model [21]. Both VAE and GAN have the ability to generate new synthetic samples which obey the distribution of real data.

Few researches involve deep learning techniques for software fault prediction. Here, we have the inspiration of utilizing the ability of generating synthetic samples of VAE and GAN to generate new fault samples. The new samples can be used to balance the class. In our previous work [22], we adopted VAE for software fault prediction and compared its performance with no-sampling method. In this paper, furthermore, both VAE and GAN are used and compared for software fault prediction. As we known, GAN has better ability to generate new image samples compared to VAE [23]. Intuitively, we get the idea of adopting both VAE and GAN to deal with the issue of imbalanced data and finding out which one is better in software fault prediction. SMOTE is also used in order to compare the performance with VAE and GAN.

In this paper, we find that deep learning techniques of VAE and GAN are useful in the field of software fault prediction. VAE has better performance than GAN and SMOTE. GAN outperforms SMOTE on some datasets.

The main contributions in this paper are as follows:

  • Software fault data are multivariable data which are different with image data. The models of VAE and GAN are designed to fit for this type of data. The deep architectures of VAE and GAN are realized on GPU TITAN X by the framework of Keras.

  • As far as we know, it is the first time we do research on both VAE and GAN for software fault prediction and the performance of VAE, GAN and SMOTE are compared. The results of experiment not only demonstrate that VAE and GAN are useful for software fault prediction, but also show that VAE outperforms GAN and SMOTE. It can be inferred that VAE has better ability than GAN on generating multivariable data of software fault, though GAN has better performance than VAE for generating image.

The rest part of the paper is structured as follows: in Sect. 2, the background knowledge is described; in Sect. 3, the methods of experiment are demonstrated; in Sect. 4, the results of experiment are given out; a conclusion is drawn in Sect. 5.

2 Background

2.1 Variational Autoencoder (VAE)

In 2014, Kingma proposed the theory of VAE. VAE is the theory that combines statistics learning and deep learning techniques [20]. VAE can generate new samples which obey the probability distribution of Z. Assuming Z is subject to Gaussian distribution p(Z). Random sampling Z from p(Z), new samples can be created on the basis of p(Z/X). Within VAE model, assuming p(Z/X) is subject to normal distribution. Supposing the input is Xk. Xk obeys distribution of p(Z|Xk). A generator, G = g(Z), is trained. Gk can be generated by sampling Z from p(Z|Xk).

The mathematic theory of VAE is complicated, while its realization is not hard to understand in engineering [24]. The implementation of VAE is shown in Fig. 1. VAE is combined with an encoder and a decoder (generator). The input data enter the encoder, and then the encoder outputs the latent variable’s mean and logarithmic variance. After that, the outputs of encoder are transformed to obey standard normal distribution. It is implemented by formula (1) and (2). By sampling ε from the distribution of N(0, I), Z is acquired. The model of VAE is trained to minimize the loss of KL divergence. The VAE network can be trained by Stochastic Gradient Descent (SGD). Gk is the generated data by decoder (generator).

Fig. 1.
figure 1

The realization diagram of VAE

$$ \varepsilon = (z - u)/\sigma $$
(1)
$$ Z = \mu + \varepsilon \times \sigma $$
(2)

2.2 Generative Adversarial Networks (GAN)

Generative Adversarial Networks (GAN) was proposed in 2014 by Goodfellow [21]. GAN is a hot topic in recent years. It is widely used in the fields of image translation, Super-Resolution and semantic segmentation etc. The basic structure of GAN is illustrated in Fig. 2. GAN contains two parts. One is the generator G, the other is the discriminator D. The generator learns the distribution of real samples. Random noise is the input of the generator. The generator can utilize both random noise and the real sample’s distribution to produce fake samples in order to simulate real samples. Both real samples and fake samples go into the discriminator. The discriminator tries to determine the input is real or fake. In short, the generator can be seen as a team of counterfeiter who tries to make fake currency, and use it freely without being found. The discriminator can be seen as police who tries to find the fake currency made by counterfeiters. The generator tries to cheat the discriminator and the discriminator tries to see through the fraud.

Fig. 2.
figure 2

Basic structure of GAN: the noise in latent space is the input of G. G generates fake samples. The real samples and fake samples go into D respectively. D will determine the sample is real or fake. The determination of D will be compared with the ground truth. The result of comparison will be sent back to G and D. G and D begin to adjust the parameters of networks by fine tune training.

In the paper of Goodfellow, the generator and the discriminator are composed of multilayer perceptrons. The objective function can be seen in Eq. (3).

$$ \mathop {\hbox{min} }\limits_{G} \mathop {\hbox{max} }\limits_{D} V(D,G): = E_{x \sim px} [\log D(x)] + E_{x \sim pg} [\log (1 - D(G(z)))] $$
(3)

The output of D is a single scalar, \( D(x) \) is the probability of denoting \( x \) from real samples. D is trained to maximize the probability of giving correct label to real sample and fake sample. G is also trained simultaneously to minimize \( \log (1 - D(G(z))) \). The generator and the discriminator compete with each other. This is a problem of min-max game. At last the generator and the discriminator reach Nash equilibrium.

2.3 Synthetic Minority Over-Sampling Technique (SMOTE)

SMOTE (Synthetic Minority Over-sampling Technique) was presented in 2002 by NV Chawla. It is the improvement of random over-sample. Random over-sample just increases samples by copying original samples. This always brings out the problem of poor generalization. SMOTE can improve the generalization ability. It can analyze the minority and generate synthetic samples for minority. In fact, the core of the technique is based on the idea of interpolation. The realization of SMOTE is as follows:

  • Given a sample \( x_{i} \) in minority, \( i \in \left\{ {1, \ldots ,T} \right\} \), T is the amount of samples in minority; Computing the Euclidean distance to each sample in the set of minority, k neighbors are achieved. \( x_{i(near)} \), \( near \in (1, \ldots ,k) \)

  • A sample \( x_{i(nn)} \) from k neighbors is chosen randomly. New sample is synthetized by the following formula.

    $$ x_{i1} = x_{i} + \zeta \cdot (x_{i(nn)} - x_{i} ),\quad \zeta \in (0,1) $$
    (4)
  • Repeating N times, N new samples are generated from sample \( x_{i} \). \( x_{inew} \), \( new \in 1, \ldots N \).

3 Experimental Methodology

The experimental methodology is demonstrated in this section. As can be seen from the flow chart of Fig. 3, the main idea of the experiment is to balance the software fault data by the methods of VAE, GAN and SMOTE, which are used to generate synthetic fault samples to increase the amount of samples for minority. New fault samples generated by different methods will be added into original software fault data respectively. This can make the amount of fault samples approach the amount of non-fault samples. The flow chart of experiment will be explained in further details.

Fig. 3.
figure 3

The flow chart of the experiment

At beginning, data are processed. Data processing includes deleting missing data and normalization. And then the models of VAE and GAN are trained to generate new fault samples. SMOTE is also adopted to generate synthetic fault samples. The samples generated by VAE, GAN and SMOTE are added into original fault data (minority) respectively. These methods are called “VAE”, “GAN” and “SMOTE” in the flow chart of Fig. 3. Four classifiers, such as RF (Random Forrest), SVM (Support Vector Machine), LR (Logistic Regression) and DT (Decision Tree) are adopted to get the results of software fault prediction. The measures of AUC, MCC, recall and F1-measure are selected to evaluate the results of classifiers. The performance is compared between the methods of VAE, GAN and SMOTE. The results of VAE and GAN in this experiment are also compared with the results of paper [5]. The experiment is implemented on the GPU of TITAN X. The runtime environment is as follows:

  • Python 3.6

  • Keras 2.1.6

  • Tensorflow 1.4.1.

3.1 Data Processing

Data processing is the premise of the experiment. After deleting missing data, normalization is carried out in the process of data processing. Normalization has the ability to make values of different dimension to the same scope. Min-Max scaling and Z-score are classical techniques for normalization. Min-Max scaling is always adopted in neural network. In this paper, VAE and GAN are designed by the frameworks of MLP. We choose Min-Max scaling for normalization. The transformation of Min-Max scaling can be realized by the following formula (5).

$$ Z = \frac{{x_{i} - Min(x_{i} )}}{{Max(x_{i} ) - Min(x_{i} )}} $$
(5)

As for the missing data, there are several ways to deal with them. For example, let missing data be 0, 1 or the means of feature. Here, in order to reduce uncertainty, the missing instances are deleted. This is done by Pandas.

3.2 The Design of VAE

It can be seen from Table 1, the architecture of VAE is designed by MLP (multilayer perceptron).The dimension of input data for encoder is the number of code attributes of software fault data. In this experiment, the dimension of input is 21. The number of neurons of hidden layer is set to 100. In fact, in the process of training, it can be found that when we set the number of neuron of hidden layer to 100, the value of loss reduced rapidly. The dimension for output of encoder is 2. The output is mean and logarithmic variance of latent variable. As for the decoder (generator), the number of neurons of the first dense is also 100, and the output is fault instance of simulation. The dimension of the generator’s output is also 21. RMSProp is selected as the optimizer for the model of VAE. The loss function of the model is KL divergence.

Table 1. The architecture of VAE

3.3 The Design of GAN

The most difficult problem in the experiment is training the module of GAN. MLP is selected to set up generator and discriminator. After several times of failed attempt, a successful model is achieved for software fault data. The architecture of GAN can be seen in Table 2. The loss curves of generator and discriminator are as expected, which are shown in Fig. 4. Adam is selected as the optimizer for the model of GAN. The Leaky ReLU slope is set to 0.2.

Table 2. The architecture of GAN
Fig. 4.
figure 4

The loss curves of generator and discriminator

3.4 Evaluation

In the experiment, AUC, MCC, recall and F1-measure are selected to evaluate the results of the experiment. AUC is the area under ROC (ROC represents receiver operating characteristic curve).

$$ recall = \frac{TP}{TP + FN} $$
(6)
$$ F1 = \frac{2 \times (recall \times precision)}{recall + precision} $$
(7)
$$ MCC = \frac{(TP \times TN) - (FP \times FN)}{{\sqrt {(TP \times FP) + (TP \times FN) + (TN \times FP) + (TN \times FN)} }} $$
(8)
  • MCC: Matthews correlation coefficient;

  • F-measure: the harmonic mean of recall and precision;

  • TP: True Positives; FP: False Positives;

  • TN: True Negatives; FN: False negatives.

4 The Results of Experiment

In the experiment, three datasets of JM1, KC1 and KC2 are selected from Promise Repository. The datasets are public and can be downloaded from the Internet [25]. The three datasets are about software fault of spaceflight from NASA. In the three datasets, the number of code attributes is 21. The code attributes include the metrics of McCabe, Halstead, LOC and Miscellaneous which can be found in Table 3.

Table 3. The metrics in JM1, KC1, and KC2

The amount of normal and anomaly instances used by the methods of VAE, GAN, and SMOTE can be seen in in Table 4. Anomaly instances represent fault data which belong to the minority. We focus on the performance of different methods for fault prediction in the experiment. The measures of AUC, MCC, recall and F1-measure are compared. In the three datasets, 90% of data are used for training set and 10% of data are used for test set. Three methods including VAE, GAN and SMOTE are adopted in the experiment. The results of AUC and MCC are shown in Table 5.

Table 4. Data in the experiment
Table 5. AUC and MCC of VAE, GAN and SMOTE

For JM1, in the three methods, VAE has the best AUC and MCC by classifier of Random Forest. The values of AUC and MCC are 0.92 and 0.78 respectively. The best AUC and MCC of GAN are 0.92 and 0.77 by classifier of Random Forest, which are higher than the best AUC and MCC of SMOTE. The best AUC and MCC of SMOTE are 0.89 and 0.73 respectively.

For KC1, in the three methods, VAE has the best AUC and MCC by classifier of Random Forest. The values of AUC and MCC are 0.94 and 0.81 respectively. The best AUC and MCC of GAN are 0.87 and 0.65 by classifier of SVM, which are lower than the best AUC and MCC of SMOTE. The best AUC and MCC of SMOTE are 0.88 and 0.69 respectively.

For KC2, in the three methods, VAE has the best AUC and MCC by classifier of Logistic Regression. The values of AUC and MCC are 0.94 and 0.83 respectively. The best AUC and MCC of GAN are 0.93 and 0.78 by classifier of Random Forest, which are higher than the best AUC and MCC of SMOTE. The best AUC and MCC of SMOTE are 0.92 and 0.76 respectively.

The comparison of the best average recall in the three datasets by the methods of VAE, GAN and SMOTE can be seen in Fig. 5. For JM1, the recall of VAE is 0.89, which is the highest in the three methods. The recall of GAN is 0.88 and it is higher than that of SMOTE. The recall of SMOTE is 0.86. For KC1, the recall of VAE is 0.90, which is the highest in the three methods. The recall of GAN is 0.83 and it is lower than that of SMOTE. The recall of SMOTE is 0.85. For KC2, the recall of VAE is 0.92, which is the highest in the three methods. The recall of GAN is 0.89 and it is higher than that of SMOTE. The recall of SMOTE is 0.88.

Fig. 5.
figure 5

The best average recall on three datasets by VAE, GAN and SMOTE

The comparison of the best average F1-measure in the three datasets by the methods of VAE, GAN and SMOTE can be seen in Fig. 6. It is the same with the comparison of recall. VAE has the highest F1-measure in the three methods. GAN has better F1-measure than that of SMOTE on the datasets of JM1 and KC2.

Fig. 6.
figure 6

The best average Fl-Measure on three datasets by VAE, GAN and SMOTE

The results of AUC of two datasets including JM1 and KC1 can be found in paper [5]. We compare the best AUC in paper [5] with the best AUC of VAE and GAN in this experiment. Comparison is shown in Table 6. It can be found that the methods of VAE and GAN have higher AUC than that of the best results in paper [5].

Table 6. Comparison of AUC

From comparison of the experiment, we find that VAE has better performance than GAN and SMOTE on the three datasets of JM1, KC1 and KC2. GAN has better performance than SMOTE on the datasets of JM1 and KC2. The best AUC of VAE and GAN outperform the best AUC acquired in paper [5]. It can be said that VAE and GAN are useful methods for software fault prediction. Compared with VAE, GAN usually has better ability to generate image samples. While in this experiment, VAE outperforms GAN for generating software fault samples.

5 Conclusion

In this paper, we utilize deep learning techniques of VAE and GAN for software fault prediction and compare the performance of them. The architectures of VAE and GAN are designed to fit for the multivariable data of software fault. The ability of VAE and GAN to generate new fault samples is used to balance the normal and anomaly class. An experiment is implemented to verify the scheme we proposed. Typical datasets of JM1, KC1 and KC2 are selected, which are from NASA’s software projects of spaceflight. Four classifiers are used for the experiment. We find that the scheme of VAE has better performance than the schemes of GAN and SMOTE for software fault prediction. Though GAN usually has better ability to generate image compared with VAE, it does not have better performance than VAE for generating software fault data in this experiment. GAN has better performance than SMOTE on the datasets of JM1 and KC2. Comparing the results of VAE and GAN with the results in paper [5], it can be found that VAE and GAN have better AUC. It can be inferred that it is practicable to apply deep learning techniques of VAE and GAN for software fault prediction, and VAE has better performance compared to GAN.