Abstract
Software fault prediction (SFP) is the process of identifying potentially defect-prone modules before the testing stage of a software development process. By identifying faults early in the development process, software engineers can spend their efforts on those components most likely to contain defects, thereby improving the overall quality and reliability of the software. However, data imbalance and feature redundancy are challenging issues in SFP that can negatively impact the performance of fault prediction models. Imbalanced software fault datasets, in which the number of normal modules (majority class) is significantly higher than that of faulty modules (minority class), may lead to many false negative results. In this work, we study and perform an empirical assessment of the variants of Generative Adversarial Networks (GANs), an emerging synthetic data generation method, for resolving the data imbalance issue in common software fault prediction datasets. Five GANs variations - CopulaGAN, VanillaGAN, CTGAN, TGAN and WGANGP are utilized to generate synthetic faulty samples to balance the proportion of the majority and minority classes in datasets. Thereafter, we present an extensive evaluation of the performance of different prediction models which involve combining Recursive Feature Elimination (RFE) for feature selection with GANs oversampling methods, along with pairs of Autoencoders for feature extraction with GANs models. Throughout the experiments with five fault datasets extracted from the PROMISE repository, we evaluate six different machine learning approaches using precision, recall, F1-score, Area Under Curve (AUC) and Matthews Correlation Coefficient (MCC) as performance evaluation metrics. The experimental results demonstrate that the combination of CTGAN with RFE and a pair of CTGAN with Autoencoders outperform other baselines for all datasets, followed by WGANGP and VanillaGAN. According to the comparative analysis, GANs-based oversampling methods exhibited significant improvement in dealing with data imbalance for software fault prediction.
Graphical abstract
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of Data and Materials
The source code of these projects from Apache is available at https://github.com/ApoorvaKrisna/NASA-promise-dataset-repository?tab=readme-ov-file
Code Availability
This source code is available in the GitHub https://github.com/htmphuong/GANPaper/tree/main
References
Mangla M, Sharma N, Mohanty SN (2021) A sequential ensemble model for software fault prediction. Innovations in Systems and Software Engineering 1–8
Balaram A, Vasundra S (2022) Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm. Autom Softw Eng 29(1):6
Rathore SS, Kumar S (2019) A study on software fault prediction techniques. Artif Intell Rev 51:255–327
Pandey SK, Mishra RB, Tripathi AK (2021) Machine learning based methods for software fault prediction: A survey. Expert Syst Appl 172:114595
Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140. Learning in the Presence of Class Imbalance and Concept Drift
Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24:602–636
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: An overview. IEEE Signal Process Mag 35(1):53–65
Xu L, Veeramachaneni K (2018) Synthesizing Tabular Data Using Generative Adversarial Networks
Rathore SS, Chouhan SS, Jain DK, Vachhani AG (2022) Generative oversampling methods for handling imbalanced data in software fault prediction. IEEE Trans Reliab 71(2):747–762
CopulaGAN (2023) CopulaGAN Model. Available: https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/copulagansynthesizer
Sun Y, Jing X-Y, Wu F, Li J, Xing D, Chen H, Sun Y (2020) Adversarial learning for cross-project semi-supervised defect prediction. IEEE Access 8:32674–32687
Cetiner M, Sahingoz OK (2020) A comparative analysis for machine learning based software defect prediction systems. In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–7. IEEE
Ahmed MR, Ali MA, Ahmed N, Zamal MFB, Shamrat FJM (2020) The impact of software fault prediction in real-world application: An automated approach for software engineering. In: Proceedings of 2020 the 6th International Conference on Computing and Data Engineering, pp. 247–251
Kaur R, Sharma S (2019) An ann based approach for software fault prediction using object oriented metrics. In: Advanced Informatics for Computing Research: Second International Conference, ICAICR 2018, Shimla, India, pp. 341–354. Springer
Ouellet A, Badri M (2019) Empirical analysis of object-oriented metrics and centrality measures for predicting fault-prone classes in object-oriented software. In: Quality of Information and Communications Technology: 12th International Conference, QUATIC 2019, Ciudad Real, Spain, pp. 129–143. Springer
Malhotra R, Nishant N, Gurha S, Rathi V (2021) Application of particle swarm optimization for software defect prediction using object oriented metrics. In: 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 88–93
Borandag E, Ozcift A, Kilinc D, Yucalar F (2019) Majority vote feature selection algorithm in software fault prediction. Comput Sci Inf Syst 16(2):515–539
Sunil JM, Kumar L, Neti LBM (2018) Bayesian logistic regression for software defect prediction (s). In: SEKE, pp. 421–420
Turabieh H, Mafarja M, Li X (2019) Iterated feature selection algorithms with layered recurrent neural network for software fault prediction. Expert Syst Appl 122:27–42
Erturk E, Sezer EA (2015) A comparison of some soft computing methods for software fault prediction. Expert Syst Appl 42(4):1872–1879
Balogun AO, Basri S, Abdulkadir SJ, Mahamad S, Al-momamni MA, Imam AA, Kumar GM (2021) Rank aggregation based multi-filter feature selection method for software defect prediction. In: Advances in Cyber Security: Second International Conference, ACeS 2020, Penang, Malaysia, pp. 371–383. Springer
Phuong HTM, My Hanh LT, Binh NT (2022) A study of filter-based feature selection in software fault prediction. In: International Conference on Intelligence of Things, pp. 58–67. Springer
Xu Z, Liu J, Luo X, Yang Z, Zhang Y, Yuan P, Tang Y, Zhang T (2019) Software defect prediction based on kernel pca and weighted extreme learning machine. Inf Softw Technol 106:182–200
Balogun AO, Basri S, Jadid SA, Mahamad S, Al-momani MA, Bajeh AO, Alazzawi AK (2020) Search-based wrapper feature selection methods in software defect prediction: an empirical analysis. In: Intelligent Algorithms in Software Engineering: Proceedings of the 9th Computer Science On-line Conference 2020, Volume 1 9, pp. 492–503. Springer
Tumar I, Hassouneh Y, Turabieh H, Thaher T (2020) Enhanced binary moth flame optimization as a feature selection algorithm to predict software fault prediction. Ieee Access 8:8041–8055
Long NT, Phuong HTM, Binh NT (2023) A comparative study of wrapper feature selection techniques in software fault prediction. In: Conference on Information Technology and Its Applications, pp. 62–73. Springer
Hassouneh Y, Turabieh H, Thaher T, Tumar I, Chantar H, Too J (2021) Boosted whale optimization algorithm with natural selection operators for software fault prediction. IEEE Access 9:14239–14258
Wang K, Liu L, Yuan C, Wang Z (2021) Software defect prediction model based on lasso-svm. Neural Comput Appl 33:8249–8259
Amini F, Hu G (2021) A two-layer feature selection method using genetic algorithm and elastic net. Expert Syst Appl 166:114072
Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto Ki (2007) The effects of over and under sampling on fault-prone module detection. In: First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 196–204
Kovács G (2019) Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354
Lin C-T, Hsieh T-Y, Liu Y-T, Lin Y-Y, Fang C-N, Wang Y-K, Yen G, Pal NR, Chuang C-H (2018) Minority oversampling in kernel adaptive subspaces for class imbalanced datasets. IEEE Trans Knowl Data Eng 30(5):950–962
Cheng M, Wu G, Yuan M, Wan H (2016) Semi-supervised software defect prediction using task-driven dictionary learning. Chin J Electron 25(6):1089–1096
Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, Ahmad S (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE access 6:24184–24195
Gupta A, Sharma S, Goyal S, Rashid M (2020) Novel xgboost tuned machine learning model for software bug prediction. In: 2020 International Conference on Intelligent Engineering and Management (ICIEM), pp. 376–380. IEEE
Hoc HT, Silhavy R, Prokopova Z, Silhavy P (2023) Comparing stacking ensemble and deep learning for software project effort estimation. IEEE Access
Catherine JM, Djodilatchoumy S (2021) Multi-layer perceptron neural network with feature selection for software defect prediction. In: 2021 2nd International Conference on Intelligent Engineering and Management (ICIEM), pp. 228–232. IEEE
Aljamaan H, Alazba A (2020) Software defect prediction using tree-based ensembles. In: Proceedings of the 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, pp. 1–10
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518
Halstead MH (1977) Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc., USA
McCabe TJ (1976) A complexity measure. IEEE Transactions on Software Engineering SE-2(4):308–320
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Software Eng 20(6):476–493
Lorenz M, Kidd J (1994) Object-Oriented Software Metrics: A Practical Guide. Prentice-Hall Inc, USA
Meiliana Karim S, Warnars HLHS, Gaol FL, Abdurachman E, Soewito B (2017) Software metrics for fault prediction using machine learning approaches: A literature review with promise repository dataset. In: 2017 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom), pp. 19–23
Riaz S, Arshad A, Jiao L (2018) Rough noise-filtered easy ensemble for software fault prediction. Ieee Access 6:46886–46899
Catherine JM, Djodilatchoumy S (2021) Multi-layer perceptron neural network with feature selection for software defect prediction. In: 2021 2nd International Conference on Intelligent Engineering and Management (ICIEM), pp. 228–232
Muthukrishnan R, Rohini R (2016) Lasso: A feature selection technique in predictive modeling for machine learning. In: 2016 IEEE International Conference on Advances in Computer Applications (ICACA), pp. 18–20. IEEE
Osman H, Ghafari M, Nierstrasz O (2017) Automatic feature selection by regularization to improve bug prediction accuracy. In: 2017 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation, pp. 27–32. IEEE
Rana ZA, Awais MM, Shamail S (2014) Impact of using information gain in software defect prediction models. In: International Conference on Intelligent Computing, pp. 637–648. Springer
Van Der Maaten L, Postma EO, Van Den Herik HJ (2009) Dimensionality reduction: A comparative review. J Mach Learn Res 10(66–71):13
Jayanthi R, Florence L (2019) Software defect prediction techniques using metrics based on neural network classifier. Clust Comput 22:77–88
Chen X, Zhang D, Zhao Y, Cui Z, Ni C (2019) Software defect number prediction: Unsupervised vs supervised methods. Inf Softw Technol 106:161–181
Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140
Pan C, Lu M, Xu B, Gao H (2019) An improved cnn model for within-project software defect prediction. Appl Sci 9(10):2138
Guo S, Dong J, Li H, Wang J (2021) Software defect prediction with imbalanced distribution by radius-synthetic minority over-sampling technique. Journal of Software: Evolution and Process 33(7):2362
Elahi E, Ayub A, Hussain I (2021) Two staged data preprocessing ensemble model for software fault prediction. In: 2021 International Bhurban Conference on Applied Sciences and Technologies (IBCAST), pp. 506–511. IEEE
Feng S, Keung J, Yu X, Xiao Y, Bennin KE, Kabir MA, Zhang M (2021) Coste: Complexity-based oversampling technique to alleviate the class imbalance problem in software defect prediction. Inf Softw Technol 129:106432
Mohammad UG, Imtiaz S, Shakya M, Almadhor A, Anwar F (2022) Research article an optimized feature selection method using ensemble classifiers in software defect prediction for healthcare systems
Goyal S (2022) Handling class-imbalance with knn (neighbourhood) under-sampling for software defect prediction. Artif Intell Rev 55(3):2023–2064
Abaei G, Tah WZ, Toh JZW, Hor ESJ (2022) Improving software fault prediction in imbalanced datasets using the under-sampling approach. In: 2022 11th International Conference on Software and Computer Applications, pp. 41–47
Zhao WD, Zhang SD, Wang M (2022) Software defect prediction method based on cost-sensitive random forest. In: Intelligent Information Processing XI: 12th IFIP TC 12 International Conference, pp. 369–381. Springer
Ali A, Khan N, Abu-Tair M, Noppen J, McClean S, McChesney I (2021) Discriminating features-based cost-sensitive approach for software defect prediction. Autom Softw Eng 28:1–18
Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, Ahmad S (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195
Malhotra R, Jain J (2020) Handling imbalanced data using ensemble learning in software defect prediction. In: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 300–304. IEEE
Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Software Qual J 26:97–125
Balaram A, Vasundra S (2022) Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm. Automated Software Engineering 29
Software defect prediction using cost-sensitive neural network (2015) Faruk Arar, Ayan, K. Appl Soft Comput 33:263–277
Zhang S (2020) Cost-sensitive knn classification. Neurocomputing 391:234–242
Lenka SR, Barik RK, Patra SS, Singh VP (2021) Modified decision tree learning for cost-sensitive credit card fraud detection model. In: Advances in Communication and Computational Technology: Select Proceedings of ICACCT 2019, pp. 1479–1493. Springer
Zhu M, Pham H (2018) A two-phase software reliability modeling involving with software fault dependency and imperfect fault removal. Computer Languages, Systems & Structures 53:27–42
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 785–794
Guryanov A (2019) Histogram-based algorithm for building gradient boosting ensembles of piecewise linear decision trees. In: Analysis of Images, Social Networks and Texts: 8th International Conference, Kazan, Russia, pp. 39–50. Springer
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Advances in neural information processing systems. Curran Associates, Inc 27:2672–2680
Ratliff LJ, Burden SA, Sastry SS (2013) Characterization and computation of local nash equilibria in continuous games. In: 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 917–924. IEEE
Zhu Y, Zhang Y, Yang H, Wang F (2019) Gancoder: an automatic natural language-to-programming language translation approach based on gan. In: Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, pp. 529–539. Springer
Sun Y, Xu L, Guo L, Li Y, Wang Y (2020) A comparison study of vae and gan for software fault prediction. In: Algorithms and Architectures for Parallel Processing: 19th International Conference, ICA3PP 2019, Melbourne, VIC, Australia, December 9–11, 2019, Proceedings, Part II 19, pp. 82–96. Springer
Xing Y, Qian X, Guan Y, Yang B, Zhang Y (2022) Cross-project defect prediction based on g-lstm model. Pattern Recognition Letters 160:50–57. https://doi.org/10.1016/j.patrec.2022.04.039
Chouhan SS, Rathore SS (2021) Generative adversarial networks-based imbalance learning in software aging-related bug prediction. IEEE Trans Reliab 70(2):626–642
Song W, Gan L, Bao T (2024) Software defect prediction via generative adversarial networks and pre-trained model. International Journal of Advanced Computer Science & Applications 15(3)
Zhu Z, Tong H, Wang Y, Li Y (2023) Bl-gan: Semi-supervised bug localization via generative adversarial network. IEEE Trans Knowl Data Eng 35(11):11112–11125. https://doi.org/10.1109/TKDE.2022.3225329
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. Advances in neural information processing systems 29
Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing of gans for improved quality, stability, and variation
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A (2017) Improved training of wasserstein gans. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5769–5779
S K, Durgadevi M (2021) Generative adversarial network (gan): a general review on different variants of gan and applications. In: 2021 6th International Conference on Communication and Electronics Systems (ICCES), pp. 1–8. https://doi.org/10.1109/ICCES51350.2021.9489160
Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling Tabular Data Using Conditional GAN. Curran Associates Inc
Bishop CM (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, Berlin, Heidelberg
Arora JS (2017) Introduction to optimum design (fourth edition), Fourth edition edn. Academic Press, Boston. https://www.sciencedirect.com/science/article/pii/B9780128008065000251
Lin CY (2016) A reversible data transform algorithm using integer transform for privacy-preserving data mining. J. Syst. Softw 117(C):104–112
Mullick SS, Datta S, Das S (2019) Generative adversarial minority oversampling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1695–1704
Shirabad JS, Menzies T (2005) The promise repository of software engineering databases
Mehta S, Patnaik KS (2021) Improved prediction of software defects using ensemble machine learning techniques. Neural Comput Appl 33:10551–10562
Qi X, Zhu Y, Zhang H (2017) A new meta-heuristic butterfly-inspired algorithm. Journal of computational science 23:226–239
Zhao W, Wang L, Zhang Z (2019) Atom search optimization and its application to solve a hydrogeologic parameter estimation problem. Knowl-Based Syst 163:283–304
Hashim FA, Houssein EH, Mabrouk MS, Al-Atabany W, Mirjalili S (2019) Henry gas solubility optimization: A novel physics-based algorithm. Futur Gener Comput Syst 101:646–667
Thirumoorthy K, Muneeswaran K (2021) Feature selection using hybrid poor and rich optimization algorithm for text classification. Pattern Recogn Lett 147:63–70
Malhotra R, Khan K (2020) A study on software defect prediction using feature extraction techniques. In: 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO), pp. 1139–1144. IEEE
Chicco D, Jurman G (2020) The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics 21:1–13
Thanh-Tung H, Tran T (2020) Catastrophic forgetting and mode collapse in gans. In: 2020 International Joint Conference on Neural Networks (ijcnn), pp. 1–10. IEEE
Acknowledgements
This research is funded by Funds for Science and Technology Development of the University of Danang under project number B2022-DN07-02.
Funding
This research is funded by Funds for Science and Technology Development of the University of Danang under project number B2022-DN07-02.
Author information
Authors and Affiliations
Contributions
All authors discussed the results and contributed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of Interest/Competing Interests
(check journal-specific guidelines for which heading to use) There are no conflicts of interest regarding the publication of this paper.
Ethics Approval
Not Applicable
Consent to Participate
Not Applicable
Consent for Publication
I hereby provide consent for the publication of the manuscript
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Thi Minh Phuong, H., Vu Thu Nguyet, P., Huu Nhat Minh, N. et al. A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction. Appl Intell 55, 280 (2025). https://doi.org/10.1007/s10489-024-05930-z
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-05930-z