Skip to main content

Advertisement

Log in

A novel source project and optimized training data selection approach for cross-project fault prediction

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Software fault prediction (SFP) is a great tool for limiting the software testing resources allocation and enhancing the software reliability. In reality, collecting adequate historical training data for a new developing project might be a challenging task, in such cases cross-project fault prediction (CPFP) is useful. Prior studies have demonstrated transfer learning and training data selection models for CPFP. However, existing models are unstable to the source projects that are used to train the prediction model. In addition, imbalanced projects and irrelevant features are issues to review in CPFP. To address the limitations in existing CPFP models, we propose a novel optimized source data selection model for CPFP through Wilcoxon signed-rank test-based source project selection (WPS) and an optimized training data construction (optimizedTC) technique called WPSTC. We evaluate WPSTC with 31 datasets and seven performance measures and compare it with existing fault prediction models over five conventional and one ensemble machine learning model. On average, WPSTC outperforms CPFP models and can solve existing models’ sensitivity towards selected source projects and solve the imbalance and curse of dimensionality issues of CPFP models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The datasets analysed during the current study are available in the CPFP repository, https://github.com/pravaliManchala/CPFP.

References

  1. Xu Z, Yuan P, Zhang T, Tang Y, Li S, Xia Z (2018) Hda: cross-project defect prediction via heterogeneous domain adaptation with dictionary learning. IEEE Access 6:57597–57613

    Article  Google Scholar 

  2. Song Q, Guo Y, Shepperd M (2018) A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans Softw Eng 45(12):1253–1269

    Article  Google Scholar 

  3. Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304

    Article  Google Scholar 

  4. Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6):1806–1817

  5. Song Q, Jia Z, Shepperd M, Ying S, Liu J (2010) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370

    Article  Google Scholar 

  6. Chen H, Jing X-Y, Li Z, Wu D, Peng Y, Huang Z (2020) An empirical study on heterogeneous defect prediction approaches. IEEE Trans Softw Eng 47(12):2803–2822

    Article  Google Scholar 

  7. Bowes D, Hall T, Petrić J (2018) Software defect prediction: do different classifiers find the same defects? Softw Qual J 26(2):525–552

    Article  Google Scholar 

  8. Pravali M, Manjubala B (2022) Diversity based imbalance learning approach for software fault prediction using machine learning models. Appl Soft Comput 124:109069

    Article  Google Scholar 

  9. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  10. Tarawneh AS, Hassanat AB, Almohammadi K, Chetverikov D, Bellinger C (2020) Smotefuna: synthetic minority over-sampling technique based on furthest neighbour algorithm. IEEE Access 8:59069–59082

    Article  Google Scholar 

  11. He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199

    Article  Google Scholar 

  12. Liu C, Yang D, Xia X, Yan M, Zhang X (2019) A two-phase transfer learning model for cross-project defect prediction. Inf Softw Technol 107:125–136

    Article  Google Scholar 

  13. Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35th International Conference on Software Engineering (ICSE). IEEE, pp 382–391

  14. Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578

    Article  Google Scholar 

  15. Gong L, Jiang S, Bo L, Jiang L, Qian J (2020) A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Trans Reliab 69(1):40–54

    Article  Google Scholar 

  16. Tang S, Huang S, Zheng C, Liu E, Zong C, Ding Y (2021) A novel cross-project software defect prediction algorithm based on transfer learning. Tsinghua Sci Technol 27(1):41–57

    Article  Google Scholar 

  17. Bhat NA, Farooq SU (2022) An improved method for training data selection for cross-project defect prediction. Arabian J Sci Eng 47(2):1939–1954

    Article  Google Scholar 

  18. Herbold S (2013) Training data selection for cross-project defect prediction. In: Proceedings of the 9th International Conference on Predictive Models in Software Engineering, pp 1–10

  19. Zheng S, Gai J, Yu H, Zou H, Gao S (2021) Training data selection for imbalanced cross-project defect prediction. Comput Electr Eng 94:107370

    Article  Google Scholar 

  20. Sun Z, Li J, Sun H, He L (2021) Cfps: collaborative filtering based source projects selection for cross-project defect prediction. Appl Soft Comput 99:106940

    Article  Google Scholar 

  21. Bin Y, Zhou K, Lu H, Zhou Y, Xu B (2017) Training data selection for cross-project defection prediction: which approach is better?. In: 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, pp 354–363

  22. Hosseini S, Turhan B, Mäntylä M (2018) A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Inf Softw Technol 95:296–312

    Article  Google Scholar 

  23. Khatri Y, Singh SK (2023) An effective software cross-project fault prediction model for quality improvement. Sci Comput Program 226:102918

    Article  Google Scholar 

  24. Kanwar S, Awasthi LK, Shrivastava V (2023) Candidate project selection in cross project defect prediction using hybrid method. Expert Syst Appl 218:119625

    Article  Google Scholar 

  25. Tong H, Liu B, Wang S, Li Q (2019) Transfer-learning oriented class imbalance learning for cross-project defect prediction. arXiv preprint arXiv:1901.08429

  26. Yu Q, Jiang S, Qian J (2016) Which is more important for cross-project defect prediction: instance or feature?. In: 2016 International Conference on Software Analysis, Testing and Evolution (SATE). IEEE, pp 90–95

  27. Yu Q, Qian J, Jiang S, Wu Z, Zhang G (2019) An empirical study on the effectiveness of feature selection for cross-project defect prediction. IEEE Access 7:35710–35718

    Article  Google Scholar 

  28. Boehm BW (1991) Software risk management: principles and practices. IEEE Softw 8(1):32–41

    Article  Google Scholar 

  29. Pan SJ, Tsang IW, Kwok JT, Yang Q (2010) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210

    Article  Google Scholar 

  30. Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp 91–100

  31. Rao R (2020) Rao algorithms: three metaphor-less simple algorithms for solving optimization problems. Int J Ind Eng Comput 11(1):107–130

    Google Scholar 

  32. Rao RV, Keesari HS (2021) Rao algorithms for multi-objective optimization of selected thermodynamic cycles. Eng Comput 37(4):3409–3437

    Article  Google Scholar 

  33. Thirumoorthy K et al (2022) A feature selection model for software defect prediction using binary rao optimization algorithm. Appl Soft Comput 131:109737

    Article  Google Scholar 

  34. Rathi SC, Misra S, Colomo-Palacios R, Adarsh R, Neti LBM, Kumar L (2023) Empirical evaluation of the performance of data sampling and feature selection techniques for software fault prediction. Expert Syst Appl 223:119806

    Article  Google Scholar 

  35. Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter languagereuse. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 19–24

  36. Jin C (2021) Cross-project software defect prediction based on domain adaptation learning and optimization. Expert Syst Appl 171:114637

    Article  Google Scholar 

  37. Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pp 1–10

  38. Ni C, Liu W-S, Chen X, Gu Q, Chen D-X, Huang Q-G (2017) A cluster based feature selection method for cross-project software defect prediction. J Comput Sci Technol 32(6):1090–1107

    Article  Google Scholar 

  39. Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83

    Article  Google Scholar 

  40. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  Google Scholar 

  41. Boetticher G (2007) The promise repository of empirical software engineering data. http://promisedata. org/repository

  42. Menzies T, Turhan B, Bener A, Gay G, Cukic B, Jiang Y (2008) Implications of ceiling effects in defect predictors. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 47–54

  43. Jiang Y, Cuki B, Menzies T, Bartlow N (2008) Comparing design and code metrics for software quality prediction. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 11–18

  44. Brown SA, Weyori BA, Adekoya AF, Kudjo PK (2023) The significant impact of parameter tuning on blocking bug prediction. Int J Syst Assur Eng Manag 14(5):1703–1717

    Article  Google Scholar 

  45. Vescan A, Găceanu R, Şerban C (2024) Exploring the impact of data preprocessing techniques on composite classifier algorithms in cross-project defect prediction. Autom Softw Eng 31(2):47

    Article  Google Scholar 

  46. King BM, Rosopa PJ, Minium EW (2018) Statistical reasoning in the behavioral sciences. John Wiley & Sons, Hoboken

    Google Scholar 

  47. Kerby DS (2014) The simple difference formula: aapproach to teaching nonparametric correlation. Compr Psychol 3:11

    Article  Google Scholar 

Download references

Funding

No funding was obtained for this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pravali Manchala.

Ethics declarations

Conflict of interest

The authors declared that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary material

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Manchala, P., Bisi, M. A novel source project and optimized training data selection approach for cross-project fault prediction. J Supercomput 81, 316 (2025). https://doi.org/10.1007/s11227-024-06750-1

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06750-1

Keywords