Abstract
Software fault prediction (SFP) is a great tool for limiting the software testing resources allocation and enhancing the software reliability. In reality, collecting adequate historical training data for a new developing project might be a challenging task, in such cases cross-project fault prediction (CPFP) is useful. Prior studies have demonstrated transfer learning and training data selection models for CPFP. However, existing models are unstable to the source projects that are used to train the prediction model. In addition, imbalanced projects and irrelevant features are issues to review in CPFP. To address the limitations in existing CPFP models, we propose a novel optimized source data selection model for CPFP through Wilcoxon signed-rank test-based source project selection (WPS) and an optimized training data construction (optimizedTC) technique called WPSTC. We evaluate WPSTC with 31 datasets and seven performance measures and compare it with existing fault prediction models over five conventional and one ensemble machine learning model. On average, WPSTC outperforms CPFP models and can solve existing models’ sensitivity towards selected source projects and solve the imbalance and curse of dimensionality issues of CPFP models.










Similar content being viewed by others
Data availability
The datasets analysed during the current study are available in the CPFP repository, https://github.com/pravaliManchala/CPFP.
References
Xu Z, Yuan P, Zhang T, Tang Y, Li S, Xia Z (2018) Hda: cross-project defect prediction via heterogeneous domain adaptation with dictionary learning. IEEE Access 6:57597–57613
Song Q, Guo Y, Shepperd M (2018) A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans Softw Eng 45(12):1253–1269
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6):1806–1817
Song Q, Jia Z, Shepperd M, Ying S, Liu J (2010) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370
Chen H, Jing X-Y, Li Z, Wu D, Peng Y, Huang Z (2020) An empirical study on heterogeneous defect prediction approaches. IEEE Trans Softw Eng 47(12):2803–2822
Bowes D, Hall T, Petrić J (2018) Software defect prediction: do different classifiers find the same defects? Softw Qual J 26(2):525–552
Pravali M, Manjubala B (2022) Diversity based imbalance learning approach for software fault prediction using machine learning models. Appl Soft Comput 124:109069
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Tarawneh AS, Hassanat AB, Almohammadi K, Chetverikov D, Bellinger C (2020) Smotefuna: synthetic minority over-sampling technique based on furthest neighbour algorithm. IEEE Access 8:59069–59082
He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199
Liu C, Yang D, Xia X, Yan M, Zhang X (2019) A two-phase transfer learning model for cross-project defect prediction. Inf Softw Technol 107:125–136
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35th International Conference on Software Engineering (ICSE). IEEE, pp 382–391
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578
Gong L, Jiang S, Bo L, Jiang L, Qian J (2020) A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Trans Reliab 69(1):40–54
Tang S, Huang S, Zheng C, Liu E, Zong C, Ding Y (2021) A novel cross-project software defect prediction algorithm based on transfer learning. Tsinghua Sci Technol 27(1):41–57
Bhat NA, Farooq SU (2022) An improved method for training data selection for cross-project defect prediction. Arabian J Sci Eng 47(2):1939–1954
Herbold S (2013) Training data selection for cross-project defect prediction. In: Proceedings of the 9th International Conference on Predictive Models in Software Engineering, pp 1–10
Zheng S, Gai J, Yu H, Zou H, Gao S (2021) Training data selection for imbalanced cross-project defect prediction. Comput Electr Eng 94:107370
Sun Z, Li J, Sun H, He L (2021) Cfps: collaborative filtering based source projects selection for cross-project defect prediction. Appl Soft Comput 99:106940
Bin Y, Zhou K, Lu H, Zhou Y, Xu B (2017) Training data selection for cross-project defection prediction: which approach is better?. In: 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, pp 354–363
Hosseini S, Turhan B, Mäntylä M (2018) A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Inf Softw Technol 95:296–312
Khatri Y, Singh SK (2023) An effective software cross-project fault prediction model for quality improvement. Sci Comput Program 226:102918
Kanwar S, Awasthi LK, Shrivastava V (2023) Candidate project selection in cross project defect prediction using hybrid method. Expert Syst Appl 218:119625
Tong H, Liu B, Wang S, Li Q (2019) Transfer-learning oriented class imbalance learning for cross-project defect prediction. arXiv preprint arXiv:1901.08429
Yu Q, Jiang S, Qian J (2016) Which is more important for cross-project defect prediction: instance or feature?. In: 2016 International Conference on Software Analysis, Testing and Evolution (SATE). IEEE, pp 90–95
Yu Q, Qian J, Jiang S, Wu Z, Zhang G (2019) An empirical study on the effectiveness of feature selection for cross-project defect prediction. IEEE Access 7:35710–35718
Boehm BW (1991) Software risk management: principles and practices. IEEE Softw 8(1):32–41
Pan SJ, Tsang IW, Kwok JT, Yang Q (2010) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp 91–100
Rao R (2020) Rao algorithms: three metaphor-less simple algorithms for solving optimization problems. Int J Ind Eng Comput 11(1):107–130
Rao RV, Keesari HS (2021) Rao algorithms for multi-objective optimization of selected thermodynamic cycles. Eng Comput 37(4):3409–3437
Thirumoorthy K et al (2022) A feature selection model for software defect prediction using binary rao optimization algorithm. Appl Soft Comput 131:109737
Rathi SC, Misra S, Colomo-Palacios R, Adarsh R, Neti LBM, Kumar L (2023) Empirical evaluation of the performance of data sampling and feature selection techniques for software fault prediction. Expert Syst Appl 223:119806
Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter languagereuse. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 19–24
Jin C (2021) Cross-project software defect prediction based on domain adaptation learning and optimization. Expert Syst Appl 171:114637
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pp 1–10
Ni C, Liu W-S, Chen X, Gu Q, Chen D-X, Huang Q-G (2017) A cluster based feature selection method for cross-project software defect prediction. J Comput Sci Technol 32(6):1090–1107
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Boetticher G (2007) The promise repository of empirical software engineering data. http://promisedata. org/repository
Menzies T, Turhan B, Bener A, Gay G, Cukic B, Jiang Y (2008) Implications of ceiling effects in defect predictors. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 47–54
Jiang Y, Cuki B, Menzies T, Bartlow N (2008) Comparing design and code metrics for software quality prediction. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 11–18
Brown SA, Weyori BA, Adekoya AF, Kudjo PK (2023) The significant impact of parameter tuning on blocking bug prediction. Int J Syst Assur Eng Manag 14(5):1703–1717
Vescan A, Găceanu R, Şerban C (2024) Exploring the impact of data preprocessing techniques on composite classifier algorithms in cross-project defect prediction. Autom Softw Eng 31(2):47
King BM, Rosopa PJ, Minium EW (2018) Statistical reasoning in the behavioral sciences. John Wiley & Sons, Hoboken
Kerby DS (2014) The simple difference formula: aapproach to teaching nonparametric correlation. Compr Psychol 3:11
Funding
No funding was obtained for this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
Below is the link to the electronic supplementary material.
Source code
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Manchala, P., Bisi, M. A novel source project and optimized training data selection approach for cross-project fault prediction. J Supercomput 81, 316 (2025). https://doi.org/10.1007/s11227-024-06750-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-024-06750-1