Abstract
Software fault is an unavoidable problem in software project. How to predict software fault to enhance safety and reliability of system is worth studying. In recent years, deep learning has been widely used in the fields of image, text and voice. However it is seldom applied in the field of software fault prediction. Considering the ability of deep learning, we select the deep learning techniques of VAE and GAN for software fault prediction and compare the performance of them. There is one salient feature of software fault data. The proportion of non-fault data is well above the proportion of fault data. Because of the imbalanced data, it is difficult to get high accuracy to predict software fault. As we known, VAE and GAN are able to generate synthetic samples that obey the distribution of real data. We try to take advantage of their power to generate new fault samples in order to improve the accuracy of software fault prediction. The architectures of VAE and GAN are designed to fit for the high dimensional software fault data. New software fault samples are generated to balance the software fault datasets in order to get better performance for software fault prediction. The models of VAE and GAN are trained on GPU TITAN X. SMOTE is also adopted in order to compare the performance with VAE and GAN. The results in the experiment show that VAE and GAN are useful techniques for software fault prediction and VAE has better performance than GAN on this issue.
1 Introduction
In software project, software fault is an inescapable problem. Software fault may be incurred by internal defects of software or external attacks. Many cases show that software fault can cause huge loss and catastrophic consequences. For example, in 1962, the famous software fault resulted in the failure of Mariner rocket to Venus. In 2003, the blackouts of the Northeastern United States were also because of software fault. In 2009, attackers launched offensive to the video software. This caused extensive software fault that people of 6 provinces in China could not access internet.
Software fault is closely related to security, reliability, maintainability of system [1]. Especially for high-risk system, software fault can lead to serious consequences. In software project, it is quite difficult for testers to find all the software faults. Researchers focus on software fault prediction, which can help tester estimate the number and distribution of fault reasonably. Researchers have studied on the metrics which are used to represent attributes of software. These attributes are quite helpful for software fault prediction, which can be used as features to predict software fault. The classical metrics include LOC count, McCabe [2, 3] and Halstead [2, 4].
Machine learning is always used in software fault prediction. 22 classifiers based on machine learning were used for software fault prediction in [5]. L Kumar set up the model of Least Square Support Vector Machine (LSSVM) for software fault prediction [6]. DR Ibrahim used random forest based on improved feature for software fault prediction [7]. An approach of decision tree for software fault prediction was proposed by Rathore [8]. Logistic Regression was compared with decision tree to enhance the result of software fault prediction [9].
Though these machine learning techniques are applied for software fault prediction, an important problem of software fault is ignored. That is imbalanced data [10]. Taking software fault for example, the amount of non-fault data (majority) is always well above the amount of fault data (minority) in software project. Especially, the fault data (minority) will always be predicted to the non-fault data (majority). How to resolve the difficulty of imbalanced data? There are usually two kinds of ways to deal with this problem. They are under-sampling [11] and over-sampling [12]. Under-sampling random reduces the amount of majority to balance the class of majority and minority. But it will bring out useful information loss. Usually, over-sampling is adopted. SMOTE (synthetic minority over-sampling technique) [13] is a famous over-sampling technique, which is very useful to resolve the problem of imbalanced data.
Can we utilize deep learning techniques for software fault prediction? In recent years, deep learning is widely used in many fields, such as image recognition, natural language processing [14] and voice recognition. It has achieved a resounding success. While up to present, it is seldom applied in the domain of software fault prediction.
It can be found that most applications of Variational Autoencoder(VAE) are used for image processing [15, 16]. Some of them are used for text generating [17]. Similarly, GAN (Generative Adversarial Networks) is always used for images [18, 19]. The framework of VAE is a generative model [20]. The framework of GAN is combined by a generative model and a discriminative model [21]. Both VAE and GAN have the ability to generate new synthetic samples which obey the distribution of real data.
Few researches involve deep learning techniques for software fault prediction. Here, we have the inspiration of utilizing the ability of generating synthetic samples of VAE and GAN to generate new fault samples. The new samples can be used to balance the class. In our previous work [22], we adopted VAE for software fault prediction and compared its performance with no-sampling method. In this paper, furthermore, both VAE and GAN are used and compared for software fault prediction. As we known, GAN has better ability to generate new image samples compared to VAE [23]. Intuitively, we get the idea of adopting both VAE and GAN to deal with the issue of imbalanced data and finding out which one is better in software fault prediction. SMOTE is also used in order to compare the performance with VAE and GAN.
In this paper, we find that deep learning techniques of VAE and GAN are useful in the field of software fault prediction. VAE has better performance than GAN and SMOTE. GAN outperforms SMOTE on some datasets.
The main contributions in this paper are as follows:
-
Software fault data are multivariable data which are different with image data. The models of VAE and GAN are designed to fit for this type of data. The deep architectures of VAE and GAN are realized on GPU TITAN X by the framework of Keras.
-
As far as we know, it is the first time we do research on both VAE and GAN for software fault prediction and the performance of VAE, GAN and SMOTE are compared. The results of experiment not only demonstrate that VAE and GAN are useful for software fault prediction, but also show that VAE outperforms GAN and SMOTE. It can be inferred that VAE has better ability than GAN on generating multivariable data of software fault, though GAN has better performance than VAE for generating image.
The rest part of the paper is structured as follows: in Sect. 2, the background knowledge is described; in Sect. 3, the methods of experiment are demonstrated; in Sect. 4, the results of experiment are given out; a conclusion is drawn in Sect. 5.
2 Background
2.1 Variational Autoencoder (VAE)
In 2014, Kingma proposed the theory of VAE. VAE is the theory that combines statistics learning and deep learning techniques [20]. VAE can generate new samples which obey the probability distribution of Z. Assuming Z is subject to Gaussian distribution p(Z). Random sampling Z from p(Z), new samples can be created on the basis of p(Z/X). Within VAE model, assuming p(Z/X) is subject to normal distribution. Supposing the input is Xk. Xk obeys distribution of p(Z|Xk). A generator, G = g(Z), is trained. Gk can be generated by sampling Z from p(Z|Xk).
The mathematic theory of VAE is complicated, while its realization is not hard to understand in engineering [24]. The implementation of VAE is shown in Fig. 1. VAE is combined with an encoder and a decoder (generator). The input data enter the encoder, and then the encoder outputs the latent variable’s mean and logarithmic variance. After that, the outputs of encoder are transformed to obey standard normal distribution. It is implemented by formula (1) and (2). By sampling ε from the distribution of N(0, I), Z is acquired. The model of VAE is trained to minimize the loss of KL divergence. The VAE network can be trained by Stochastic Gradient Descent (SGD). Gk is the generated data by decoder (generator).
2.2 Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN) was proposed in 2014 by Goodfellow [21]. GAN is a hot topic in recent years. It is widely used in the fields of image translation, Super-Resolution and semantic segmentation etc. The basic structure of GAN is illustrated in Fig. 2. GAN contains two parts. One is the generator G, the other is the discriminator D. The generator learns the distribution of real samples. Random noise is the input of the generator. The generator can utilize both random noise and the real sample’s distribution to produce fake samples in order to simulate real samples. Both real samples and fake samples go into the discriminator. The discriminator tries to determine the input is real or fake. In short, the generator can be seen as a team of counterfeiter who tries to make fake currency, and use it freely without being found. The discriminator can be seen as police who tries to find the fake currency made by counterfeiters. The generator tries to cheat the discriminator and the discriminator tries to see through the fraud.
In the paper of Goodfellow, the generator and the discriminator are composed of multilayer perceptrons. The objective function can be seen in Eq. (3).
The output of D is a single scalar, \( D(x) \) is the probability of denoting \( x \) from real samples. D is trained to maximize the probability of giving correct label to real sample and fake sample. G is also trained simultaneously to minimize \( \log (1 - D(G(z))) \). The generator and the discriminator compete with each other. This is a problem of min-max game. At last the generator and the discriminator reach Nash equilibrium.
2.3 Synthetic Minority Over-Sampling Technique (SMOTE)
SMOTE (Synthetic Minority Over-sampling Technique) was presented in 2002 by NV Chawla. It is the improvement of random over-sample. Random over-sample just increases samples by copying original samples. This always brings out the problem of poor generalization. SMOTE can improve the generalization ability. It can analyze the minority and generate synthetic samples for minority. In fact, the core of the technique is based on the idea of interpolation. The realization of SMOTE is as follows:
-
Given a sample \( x_{i} \) in minority, \( i \in \left\{ {1, \ldots ,T} \right\} \), T is the amount of samples in minority; Computing the Euclidean distance to each sample in the set of minority, k neighbors are achieved. \( x_{i(near)} \), \( near \in (1, \ldots ,k) \)
-
A sample \( x_{i(nn)} \) from k neighbors is chosen randomly. New sample is synthetized by the following formula.
$$ x_{i1} = x_{i} + \zeta \cdot (x_{i(nn)} - x_{i} ),\quad \zeta \in (0,1) $$(4) -
Repeating N times, N new samples are generated from sample \( x_{i} \). \( x_{inew} \), \( new \in 1, \ldots N \).
3 Experimental Methodology
The experimental methodology is demonstrated in this section. As can be seen from the flow chart of Fig. 3, the main idea of the experiment is to balance the software fault data by the methods of VAE, GAN and SMOTE, which are used to generate synthetic fault samples to increase the amount of samples for minority. New fault samples generated by different methods will be added into original software fault data respectively. This can make the amount of fault samples approach the amount of non-fault samples. The flow chart of experiment will be explained in further details.
At beginning, data are processed. Data processing includes deleting missing data and normalization. And then the models of VAE and GAN are trained to generate new fault samples. SMOTE is also adopted to generate synthetic fault samples. The samples generated by VAE, GAN and SMOTE are added into original fault data (minority) respectively. These methods are called “VAE”, “GAN” and “SMOTE” in the flow chart of Fig. 3. Four classifiers, such as RF (Random Forrest), SVM (Support Vector Machine), LR (Logistic Regression) and DT (Decision Tree) are adopted to get the results of software fault prediction. The measures of AUC, MCC, recall and F1-measure are selected to evaluate the results of classifiers. The performance is compared between the methods of VAE, GAN and SMOTE. The results of VAE and GAN in this experiment are also compared with the results of paper [5]. The experiment is implemented on the GPU of TITAN X. The runtime environment is as follows:
-
Python 3.6
-
Keras 2.1.6
-
Tensorflow 1.4.1.
3.1 Data Processing
Data processing is the premise of the experiment. After deleting missing data, normalization is carried out in the process of data processing. Normalization has the ability to make values of different dimension to the same scope. Min-Max scaling and Z-score are classical techniques for normalization. Min-Max scaling is always adopted in neural network. In this paper, VAE and GAN are designed by the frameworks of MLP. We choose Min-Max scaling for normalization. The transformation of Min-Max scaling can be realized by the following formula (5).
As for the missing data, there are several ways to deal with them. For example, let missing data be 0, 1 or the means of feature. Here, in order to reduce uncertainty, the missing instances are deleted. This is done by Pandas.
3.2 The Design of VAE
It can be seen from Table 1, the architecture of VAE is designed by MLP (multilayer perceptron).The dimension of input data for encoder is the number of code attributes of software fault data. In this experiment, the dimension of input is 21. The number of neurons of hidden layer is set to 100. In fact, in the process of training, it can be found that when we set the number of neuron of hidden layer to 100, the value of loss reduced rapidly. The dimension for output of encoder is 2. The output is mean and logarithmic variance of latent variable. As for the decoder (generator), the number of neurons of the first dense is also 100, and the output is fault instance of simulation. The dimension of the generator’s output is also 21. RMSProp is selected as the optimizer for the model of VAE. The loss function of the model is KL divergence.
3.3 The Design of GAN
The most difficult problem in the experiment is training the module of GAN. MLP is selected to set up generator and discriminator. After several times of failed attempt, a successful model is achieved for software fault data. The architecture of GAN can be seen in Table 2. The loss curves of generator and discriminator are as expected, which are shown in Fig. 4. Adam is selected as the optimizer for the model of GAN. The Leaky ReLU slope is set to 0.2.
3.4 Evaluation
In the experiment, AUC, MCC, recall and F1-measure are selected to evaluate the results of the experiment. AUC is the area under ROC (ROC represents receiver operating characteristic curve).
-
MCC: Matthews correlation coefficient;
-
F-measure: the harmonic mean of recall and precision;
-
TP: True Positives; FP: False Positives;
-
TN: True Negatives; FN: False negatives.
4 The Results of Experiment
In the experiment, three datasets of JM1, KC1 and KC2 are selected from Promise Repository. The datasets are public and can be downloaded from the Internet [25]. The three datasets are about software fault of spaceflight from NASA. In the three datasets, the number of code attributes is 21. The code attributes include the metrics of McCabe, Halstead, LOC and Miscellaneous which can be found in Table 3.
The amount of normal and anomaly instances used by the methods of VAE, GAN, and SMOTE can be seen in in Table 4. Anomaly instances represent fault data which belong to the minority. We focus on the performance of different methods for fault prediction in the experiment. The measures of AUC, MCC, recall and F1-measure are compared. In the three datasets, 90% of data are used for training set and 10% of data are used for test set. Three methods including VAE, GAN and SMOTE are adopted in the experiment. The results of AUC and MCC are shown in Table 5.
For JM1, in the three methods, VAE has the best AUC and MCC by classifier of Random Forest. The values of AUC and MCC are 0.92 and 0.78 respectively. The best AUC and MCC of GAN are 0.92 and 0.77 by classifier of Random Forest, which are higher than the best AUC and MCC of SMOTE. The best AUC and MCC of SMOTE are 0.89 and 0.73 respectively.
For KC1, in the three methods, VAE has the best AUC and MCC by classifier of Random Forest. The values of AUC and MCC are 0.94 and 0.81 respectively. The best AUC and MCC of GAN are 0.87 and 0.65 by classifier of SVM, which are lower than the best AUC and MCC of SMOTE. The best AUC and MCC of SMOTE are 0.88 and 0.69 respectively.
For KC2, in the three methods, VAE has the best AUC and MCC by classifier of Logistic Regression. The values of AUC and MCC are 0.94 and 0.83 respectively. The best AUC and MCC of GAN are 0.93 and 0.78 by classifier of Random Forest, which are higher than the best AUC and MCC of SMOTE. The best AUC and MCC of SMOTE are 0.92 and 0.76 respectively.
The comparison of the best average recall in the three datasets by the methods of VAE, GAN and SMOTE can be seen in Fig. 5. For JM1, the recall of VAE is 0.89, which is the highest in the three methods. The recall of GAN is 0.88 and it is higher than that of SMOTE. The recall of SMOTE is 0.86. For KC1, the recall of VAE is 0.90, which is the highest in the three methods. The recall of GAN is 0.83 and it is lower than that of SMOTE. The recall of SMOTE is 0.85. For KC2, the recall of VAE is 0.92, which is the highest in the three methods. The recall of GAN is 0.89 and it is higher than that of SMOTE. The recall of SMOTE is 0.88.
The comparison of the best average F1-measure in the three datasets by the methods of VAE, GAN and SMOTE can be seen in Fig. 6. It is the same with the comparison of recall. VAE has the highest F1-measure in the three methods. GAN has better F1-measure than that of SMOTE on the datasets of JM1 and KC2.
The results of AUC of two datasets including JM1 and KC1 can be found in paper [5]. We compare the best AUC in paper [5] with the best AUC of VAE and GAN in this experiment. Comparison is shown in Table 6. It can be found that the methods of VAE and GAN have higher AUC than that of the best results in paper [5].
From comparison of the experiment, we find that VAE has better performance than GAN and SMOTE on the three datasets of JM1, KC1 and KC2. GAN has better performance than SMOTE on the datasets of JM1 and KC2. The best AUC of VAE and GAN outperform the best AUC acquired in paper [5]. It can be said that VAE and GAN are useful methods for software fault prediction. Compared with VAE, GAN usually has better ability to generate image samples. While in this experiment, VAE outperforms GAN for generating software fault samples.
5 Conclusion
In this paper, we utilize deep learning techniques of VAE and GAN for software fault prediction and compare the performance of them. The architectures of VAE and GAN are designed to fit for the multivariable data of software fault. The ability of VAE and GAN to generate new fault samples is used to balance the normal and anomaly class. An experiment is implemented to verify the scheme we proposed. Typical datasets of JM1, KC1 and KC2 are selected, which are from NASA’s software projects of spaceflight. Four classifiers are used for the experiment. We find that the scheme of VAE has better performance than the schemes of GAN and SMOTE for software fault prediction. Though GAN usually has better ability to generate image compared with VAE, it does not have better performance than VAE for generating software fault data in this experiment. GAN has better performance than SMOTE on the datasets of JM1 and KC2. Comparing the results of VAE and GAN with the results in paper [5], it can be found that VAE and GAN have better AUC. It can be inferred that it is practicable to apply deep learning techniques of VAE and GAN for software fault prediction, and VAE has better performance compared to GAN.
References
Sharma, D., Chandra, P.: Software fault prediction using machine-learning techniques. Smart Comput. Inform. 78, 541–549 (2018)
Curtis, B.: Measuring the psychological complexity of software maintenance tasks with the halstead and McCabe metrics. IEEE Trans. Softw. Eng. SE 5(2), 96–104 (1979)
Yahya, N., Bakar, N.S.A.A.: McCabe’s complexity and CK metrics on the internal quality of test first implementation in Malaysian education settings. Adv. Sci. Lett. 24(2), 1201–1205 (2018)
Bailey, C.T., Dingee, W.L.: A software study using Halstead metrics. In: ACM Workshop/symposium on Measurement and Evaluation of Software Quality, pp. 189–197 (1981)
Lessmann, S.: Benchmarking classification models for software defect prediction a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)
Zhang, P., Chang, Y.T.: Software fault prediction based on grey neural network (2012)
Kanmani, S.: Object-oriented software fault prediction using neural networks. Inf. Softw. Technol. 49(5), 483–492 (2007)
Shanthini, A., Vinodhini, G., Chandrasekaran, R.M.: Bagged SVM classifier for software fault prediction. Int. J. Comput. Appl. 62(15), 21–24 (2013)
Ibrahim, D.R., Ghnemat, R., Hudaib, A.: Software defect prediction using feature selection and random forest algorithm. In: International Conference on New Trends in Computing Sciences (2017)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Donoho, D.L., Tanner, J.: Precise undersampling theorems. Proc. IEEE 98(6), 913–924 (2010)
Last, F., Douzas, G., Bacao, F.: Oversampling for imbalanced learning based on K-Means and SMOTE (2017)
Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51
Simonovsky, M., Komodakis, N.: GraphVAE: towards generation of small graphs using variational autoencoders. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11139, pp. 412–422. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01418-6_41
Semeniuta, S., Severyn, A., Barth, E.: A hybrid convolutional variational autoencoder for text generation (2017)
Ding, Z., et al.: TGAN: deep tensor generative adversarial nets for large image generation (2019)
Gurumurthy, S., Sarvadevabhatla, R.K., Babu, R.V.: DeLiGAN: generative adversarial networks for diverse and limited data. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society (2017)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Conference Proceedings: Papers Accepted to the International Conference on Learning Representations. arXiv.org (2014)
Goodfellow, I.J., et al.: Generative adversarial nets. In: International Conference on Neural Information Processing Systems (2014)
Sun, Y., Xu, L., Li, Y., et al.: Utilizing deep architecture networks of VAE in software fault prediction. In: 2018 IEEE International Conference on Parallel and Distributed Processing with Applications, Ubiquitous Computing and Communications, Big Data and Cloud Computing, Social Computing and Networking, Sustainable Computing and Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pp. 870–877. IEEE (2018)
Lesort, T., Stoian, A., Goudou, J.-F., Filliat, D.: Training discriminative models to evaluate generative ones. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11729, pp. 604–619. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30508-6_48
SOHU. https://www.sohu.com/a/226209674_500659. Accessed 25 June 2019
Promise homepage. http://promise.site.uottawa.ca/SERepository/datasets-page.html. Accessed 25 June 2019
Acknowledgement
This work is supported by the National Natural Science Foundation of China (No. 61901454), and the Foundation of key Laboratory of Space Utilization, Technology and Engineering Center for Space utilization Chinese Academy of Sciences (No. CSU-QZKT-2018-08).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sun, Y., Xu, L., Guo, L., Li, Y., Wang, Y. (2020). A Comparison Study of VAE and GAN for Software Fault Prediction. In: Wen, S., Zomaya, A., Yang, L.T. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2019. Lecture Notes in Computer Science(), vol 11945. Springer, Cham. https://doi.org/10.1007/978-3-030-38961-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-38961-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38960-4
Online ISBN: 978-3-030-38961-1
eBook Packages: Computer ScienceComputer Science (R0)