A novel source project and optimized training data selection approach for cross-project fault prediction

Manchala, Pravali; Bisi, Manjubala

doi:10.1007/s11227-024-06750-1

A novel source project and optimized training data selection approach for cross-project fault prediction

Published: 20 December 2024

Volume 81, article number 316, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Pravali Manchala¹ &
Manjubala Bisi¹

116 Accesses
Explore all metrics

Abstract

Software fault prediction (SFP) is a great tool for limiting the software testing resources allocation and enhancing the software reliability. In reality, collecting adequate historical training data for a new developing project might be a challenging task, in such cases cross-project fault prediction (CPFP) is useful. Prior studies have demonstrated transfer learning and training data selection models for CPFP. However, existing models are unstable to the source projects that are used to train the prediction model. In addition, imbalanced projects and irrelevant features are issues to review in CPFP. To address the limitations in existing CPFP models, we propose a novel optimized source data selection model for CPFP through Wilcoxon signed-rank test-based source project selection (WPS) and an optimized training data construction (optimizedTC) technique called WPSTC. We evaluate WPSTC with 31 datasets and seven performance measures and compare it with existing fault prediction models over five conventional and one ensemble machine learning model. On average, WPSTC outperforms CPFP models and can solve existing models’ sensitivity towards selected source projects and solve the imbalance and curse of dimensionality issues of CPFP models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A study on cross-project fault prediction through resampling and feature reduction along with source projects selection

Article 16 August 2024

RELMP-MM: an approach to cross project fault prediction using improved regularized extreme learning machine and identical matched metrics

Article 31 March 2022

Predictive software maintenance utilizing cross-project data

Article 23 June 2023

Data availability

The datasets analysed during the current study are available in the CPFP repository, https://github.com/pravaliManchala/CPFP.

References

Xu Z, Yuan P, Zhang T, Tang Y, Li S, Xia Z (2018) Hda: cross-project defect prediction via heterogeneous domain adaptation with dictionary learning. IEEE Access 6:57597–57613
Article Google Scholar
Song Q, Guo Y, Shepperd M (2018) A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans Softw Eng 45(12):1253–1269
Article Google Scholar
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304
Article Google Scholar
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6):1806–1817
Song Q, Jia Z, Shepperd M, Ying S, Liu J (2010) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370
Article Google Scholar
Chen H, Jing X-Y, Li Z, Wu D, Peng Y, Huang Z (2020) An empirical study on heterogeneous defect prediction approaches. IEEE Trans Softw Eng 47(12):2803–2822
Article Google Scholar
Bowes D, Hall T, Petrić J (2018) Software defect prediction: do different classifiers find the same defects? Softw Qual J 26(2):525–552
Article Google Scholar
Pravali M, Manjubala B (2022) Diversity based imbalance learning approach for software fault prediction using machine learning models. Appl Soft Comput 124:109069
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Tarawneh AS, Hassanat AB, Almohammadi K, Chetverikov D, Bellinger C (2020) Smotefuna: synthetic minority over-sampling technique based on furthest neighbour algorithm. IEEE Access 8:59069–59082
Article Google Scholar
He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199
Article Google Scholar
Liu C, Yang D, Xia X, Yan M, Zhang X (2019) A two-phase transfer learning model for cross-project defect prediction. Inf Softw Technol 107:125–136
Article Google Scholar
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35th International Conference on Software Engineering (ICSE). IEEE, pp 382–391
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578
Article Google Scholar
Gong L, Jiang S, Bo L, Jiang L, Qian J (2020) A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Trans Reliab 69(1):40–54
Article Google Scholar
Tang S, Huang S, Zheng C, Liu E, Zong C, Ding Y (2021) A novel cross-project software defect prediction algorithm based on transfer learning. Tsinghua Sci Technol 27(1):41–57
Article Google Scholar
Bhat NA, Farooq SU (2022) An improved method for training data selection for cross-project defect prediction. Arabian J Sci Eng 47(2):1939–1954
Article Google Scholar
Herbold S (2013) Training data selection for cross-project defect prediction. In: Proceedings of the 9th International Conference on Predictive Models in Software Engineering, pp 1–10
Zheng S, Gai J, Yu H, Zou H, Gao S (2021) Training data selection for imbalanced cross-project defect prediction. Comput Electr Eng 94:107370
Article Google Scholar
Sun Z, Li J, Sun H, He L (2021) Cfps: collaborative filtering based source projects selection for cross-project defect prediction. Appl Soft Comput 99:106940
Article Google Scholar
Bin Y, Zhou K, Lu H, Zhou Y, Xu B (2017) Training data selection for cross-project defection prediction: which approach is better?. In: 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, pp 354–363
Hosseini S, Turhan B, Mäntylä M (2018) A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Inf Softw Technol 95:296–312
Article Google Scholar
Khatri Y, Singh SK (2023) An effective software cross-project fault prediction model for quality improvement. Sci Comput Program 226:102918
Article Google Scholar
Kanwar S, Awasthi LK, Shrivastava V (2023) Candidate project selection in cross project defect prediction using hybrid method. Expert Syst Appl 218:119625
Article Google Scholar
Tong H, Liu B, Wang S, Li Q (2019) Transfer-learning oriented class imbalance learning for cross-project defect prediction. arXiv preprint arXiv:1901.08429
Yu Q, Jiang S, Qian J (2016) Which is more important for cross-project defect prediction: instance or feature?. In: 2016 International Conference on Software Analysis, Testing and Evolution (SATE). IEEE, pp 90–95
Yu Q, Qian J, Jiang S, Wu Z, Zhang G (2019) An empirical study on the effectiveness of feature selection for cross-project defect prediction. IEEE Access 7:35710–35718
Article Google Scholar
Boehm BW (1991) Software risk management: principles and practices. IEEE Softw 8(1):32–41
Article Google Scholar
Pan SJ, Tsang IW, Kwok JT, Yang Q (2010) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210
Article Google Scholar
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp 91–100
Rao R (2020) Rao algorithms: three metaphor-less simple algorithms for solving optimization problems. Int J Ind Eng Comput 11(1):107–130
Google Scholar
Rao RV, Keesari HS (2021) Rao algorithms for multi-objective optimization of selected thermodynamic cycles. Eng Comput 37(4):3409–3437
Article Google Scholar
Thirumoorthy K et al (2022) A feature selection model for software defect prediction using binary rao optimization algorithm. Appl Soft Comput 131:109737
Article Google Scholar
Rathi SC, Misra S, Colomo-Palacios R, Adarsh R, Neti LBM, Kumar L (2023) Empirical evaluation of the performance of data sampling and feature selection techniques for software fault prediction. Expert Syst Appl 223:119806
Article Google Scholar
Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter languagereuse. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 19–24
Jin C (2021) Cross-project software defect prediction based on domain adaptation learning and optimization. Expert Syst Appl 171:114637
Article Google Scholar
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pp 1–10
Ni C, Liu W-S, Chen X, Gu Q, Chen D-X, Huang Q-G (2017) A cluster based feature selection method for cross-project software defect prediction. J Comput Sci Technol 32(6):1090–1107
Article Google Scholar
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet Google Scholar
Boetticher G (2007) The promise repository of empirical software engineering data. http://promisedata. org/repository
Menzies T, Turhan B, Bener A, Gay G, Cukic B, Jiang Y (2008) Implications of ceiling effects in defect predictors. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 47–54
Jiang Y, Cuki B, Menzies T, Bartlow N (2008) Comparing design and code metrics for software quality prediction. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 11–18
Brown SA, Weyori BA, Adekoya AF, Kudjo PK (2023) The significant impact of parameter tuning on blocking bug prediction. Int J Syst Assur Eng Manag 14(5):1703–1717
Article Google Scholar
Vescan A, Găceanu R, Şerban C (2024) Exploring the impact of data preprocessing techniques on composite classifier algorithms in cross-project defect prediction. Autom Softw Eng 31(2):47
Article Google Scholar
King BM, Rosopa PJ, Minium EW (2018) Statistical reasoning in the behavioral sciences. John Wiley & Sons, Hoboken
Google Scholar
Kerby DS (2014) The simple difference formula: aapproach to teaching nonparametric correlation. Compr Psychol 3:11
Article Google Scholar

Download references

Funding

No funding was obtained for this study.

Author information

Authors and Affiliations

Computer Science and Engineering, NIT Warangal, Hanamkonda, 506004, India
Pravali Manchala & Manjubala Bisi

Authors

Pravali Manchala
View author publications
You can also search for this author inPubMed Google Scholar
Manjubala Bisi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Pravali Manchala.

Ethics declarations

Conflict of interest

The authors declared that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary material

Below is the link to the electronic supplementary material.

Supplementary file 1 (XLSX 142 kb)

Source code

https://github.com/pravaliManchala/source-project-and-optimized-training-data-selection-approach-for-CPFP.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Manchala, P., Bisi, M. A novel source project and optimized training data selection approach for cross-project fault prediction. J Supercomput 81, 316 (2025). https://doi.org/10.1007/s11227-024-06750-1

Download citation

Accepted: 20 November 2024
Published: 20 December 2024
DOI: https://doi.org/10.1007/s11227-024-06750-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel source project and optimized training data selection approach for cross-project fault prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A study on cross-project fault prediction through resampling and feature reduction along with source projects selection

RELMP-MM: an approach to cross project fault prediction using improved regularized extreme learning machine and identical matched metrics

Predictive software maintenance utilizing cross-project data

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Supplementary material

Supplementary file 1 (XLSX 142 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now