skip to main content
10.1145/2972958.2972964acmotherconferencesArticle/Chapter ViewAbstractPublication PagespromiseConference Proceedingsconference-collections
research-article

Search Based Training Data Selection For Cross Project Defect Prediction

Published: 09 September 2016 Publication History

Abstract

Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction (CPDP). On the other hand, data quality is an issue to consider in CPDP.
Aim: We aim at utilising the Nearest Neighbor (NN)-Filter, embedded in a genetic algorithm, for generating evolving training datasets to tackle CPDP, while accounting for potential noise in defect labels.
Method: We propose a new search based training data (i.e., instance) selection approach for CPDP called GIS (Genetic Instance Selection) that looks for solutions to optimize a combined measure of F-Measure and GMean, on a validation set generated by (NN)-filter. The genetic operations consider the similarities in features and address possible noise in assigned defect labels. We use 13 datasets from PROMISE repository in order to compare the performance of GIS with benchmark CPDP methods, namely (NN)-filter and naive CPDP, as well as with within project defect prediction (WPDP).
Results: Our results show that GIS is significantly better than (NN)-Filter in terms of F-Measure (p -- value ≪ 0.001, Cohen's d = 0.697) and GMean (p -- value ≪ 0.001, Cohen's d = 0.946). It also outperforms the naive CPDP approach in terms of F-Measure (p -- value ≪ 0.001, Cohen's d = 0.753) and GMean (p -- value ≪ 0.001, Cohen's d = 0.994). In addition, the performance of our approach is better than that of WPDP, again considering F-Measure (p -- value ≪ 0.001, Cohen's d = 0.227) and GMean (p -- value ≪ 0.001, Cohen's d = 0.595) values.
Conclusions: We conclude that search based instance selection is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of low precision. Using different optimization goals, e.g. targeting high precision, would be a future direction to investigate.

References

[1]
V. R. Basili, L. C. Briand, and W. L. Melo. A validation of object-oriented design metrics as quality indicators. Software Engineering, IEEE Transactions on, 22(10):751--761, 1996.
[2]
V. R. Basili, F. Shull, and F. Lanubile. Building knowledge through families of experiments. Software Engineering, IEEE Transactions on, 25(4):456--473, 1999.
[3]
G. Canfora, A. D. Lucia, M. D. Penta, R. Oliveto, A. Panichella, and S. Panichella. Defect prediction as a multiobjective optimization problem. Software Testing, Verification and Reliability, 25(4):426--459, 2015.
[4]
C. Catal and B. Diri. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences, 179(8):1040--1058, 2009.
[5]
M. D'Ambros, M. Lanza, and R. Robbes. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering, 17(4-5):531--577, 2012.
[6]
K. El Emam, W. Melo, and J. C. Machado. The prediction of faulty classes using object-oriented design metrics. Journal of Systems and Software, 56(1):63--75, 2001.
[7]
T. Gyimothy, R. Ferenc, and I. Siket. Empirical validation of object-oriented metrics on open source software for fault prediction. Software Engineering, IEEE Transactions on, 31(10):897--910, 2005.
[8]
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng., 38(6):1276--1304, Nov. 2012.
[9]
P. He, B. Li, D. Zhang, and Y. Ma. Simplification of training data for cross-project defect prediction. arXiv preprint arXiv: 1405.0773, 2014.
[10]
Z. He, F. Shu, Y. Yang, M. Li, and Q. Wang. An investigation on the feasibility of cross-project defect prediction. Automated Software Engineering, 19(2):167--199, 2012.
[11]
S. Herbold. Training data selection for cross-project defect prediction. In Proceedings of the 9th International Conference on Predictive Models in Software Engineering, page 6. ACM, 2013.
[12]
M. Jureczko and L. Madeyski. Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, page 9. ACM, 2010.
[13]
M. Jureczko and D. Spinellis. Using object-oriented design metrics to predict software defects. Models and Methods of System Dependability. Oficyna Wydawnicza Politechniki Wrocławskiej, pages 69--81, 2010.
[14]
S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. Software Engineering, IEEE Transactions on, 34(4):485--496, 2008.
[15]
Y. C. Liu, T. M. Khoshgoftaar, and N. Seliya. Evolutionary optimization of software quality modeling with multiple repositories. Software Engineering, IEEE Transactions on, 36(6):852--864, 2010.
[16]
T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data. West Virginia University, Department of Computer Science, 2012.
[17]
T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. Software Engineering, IEEE Transactions on, 33(1):2--13, 2007.
[18]
T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, and Y. Jiang. Implications of ceiling effects in defect predictors. In Proceedings of the 4th international workshop on Predictor models in software engineering, pages 47--54. ACM, 2008.
[19]
N. Nagappan and T. Ball. Static analysis tools as early indicators of pre-release defect density. In Proceedings of the 27th international conference on Software engineering, pages 580--586. ACM, 2005.
[20]
N. Nagappan and T. Ball. Use of relative code churn measures to predict system defect density. In Software Engineering, 2005. ICSE 2005. Proceedings. 27th International Conference on, pages 284--292. IEEE, 2005.
[21]
N. Nagappan, T. Ball, and A. Zeller. Mining metrics to predict component failures. In Proceedings of the 28th international conference on Software engineering, pages 452--461. ACM, 2006.
[22]
D. Radjenović, M. Heričko, R. Torkar, and A. Živkovič. Software fault prediction metrics: A systematic literature review. Information and Software Technology, 55(8):1397--1418, 2013.
[23]
D. Ryu, J.-I. Jang, and J. Baik. A hybrid instance selection using nearest-neighbor for cross-project defect prediction. Journal of Computer Science and Technology, 30(5):969--980, 2015.
[24]
D. Ryu, J.-I. Jang, and J. Baik. A transfer cost-sensitive boosting approach for cross-project defect prediction. Software Quality Journal, pages 1--38, 2015.
[25]
M. J. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the NASA software defect datasets. IEEE Trans. Software Eng., 39(9):1208--1215, 2013.
[26]
J. Śliwerski, T. Zimmermann, and A. Zeller. When do changes induce fixes? SIGSOFT Softw. Eng. Notes, 30(4):1--5, May 2005.
[27]
R. Subramanyam and M. S. Krishnan. Empirical analysis of ck metrics for object-oriented design complexity: Implications for software defects. Software Engineering, IEEE Transactions on, 29(4):297--310, 2003.
[28]
B. Turhan. On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17(1-2):62--74, 2012.
[29]
B. Turhan and A. Bener. Analysis of naive bayes' assumptions on software fault data: An empirical study. Data & Knowledge Engineering, 68(2):278--290, 2009.
[30]
B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540--578, 2009.
[31]
B. Turhan, A. T. Mιsιrlι, and A. Bener. Empirical evaluation of the effects of mixed project data on learning defect predictors. Information and Software Technology, 55(6):1101--1118, 2013.
[32]
S. Watanabe, H. Kaiya, and K. Kaijiri. Adapting a fault prediction model to allow inter languagereuse. In Proceedings of the 4th international workshop on Predictor models in software engineering, pages 19--24. ACM, 2008.
[33]
C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén. Experimentation in software engineering. Springer Science & Business Media, 2012.
[34]
F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou. Towards building a universal defect prediction model with rank transformed predictors. Empirical Software Engineering, pages 1--39, 2015.
[35]
T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 91--100. ACM, 2009.

Cited By

View all
  • (2025)Comprehensive Bibliographic Survey and Forward-Looking Recommendations for Software Defect Prediction: Datasets, Validation Methodologies, Prediction Approaches, and ToolsIEEE Access10.1109/ACCESS.2024.351741913(866-903)Online publication date: 2025
  • (2024)Dynamic learner selection for cross-project fault predictionInternational Journal of System Assurance Engineering and Management10.1007/s13198-024-02586-3Online publication date: 18-Nov-2024
  • (2024)A systematic review of transfer learning in software engineeringMultimedia Tools and Applications10.1007/s11042-024-19756-x83:39(87237-87298)Online publication date: 27-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
PROMISE 2016: Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering
September 2016
84 pages
ISBN:9781450347723
DOI:10.1145/2972958
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cross Project Defect Prediction
  2. Genetic Algorithms
  3. Instance Selection
  4. Search Based Optimization
  5. Training Data Selection

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PROMISE 2016

Acceptance Rates

PROMISE 2016 Paper Acceptance Rate 10 of 23 submissions, 43%;
Overall Acceptance Rate 98 of 213 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Comprehensive Bibliographic Survey and Forward-Looking Recommendations for Software Defect Prediction: Datasets, Validation Methodologies, Prediction Approaches, and ToolsIEEE Access10.1109/ACCESS.2024.351741913(866-903)Online publication date: 2025
  • (2024)Dynamic learner selection for cross-project fault predictionInternational Journal of System Assurance Engineering and Management10.1007/s13198-024-02586-3Online publication date: 18-Nov-2024
  • (2024)A systematic review of transfer learning in software engineeringMultimedia Tools and Applications10.1007/s11042-024-19756-x83:39(87237-87298)Online publication date: 27-Jul-2024
  • (2023)An effective software cross-project fault prediction model for quality improvementScience of Computer Programming10.1016/j.scico.2022.102918226:COnline publication date: 1-Mar-2023
  • (2023)Candidate project selection in cross project defect prediction using hybrid methodExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119625218:COnline publication date: 15-May-2023
  • (2023)An effective feature selection based cross-project defect prediction model for software quality improvementInternational Journal of System Assurance Engineering and Management10.1007/s13198-022-01831-x14:S1(154-172)Online publication date: 11-Jan-2023
  • (2023)Software defect prediction using hybrid techniques: a systematic literature reviewSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-022-07738-w27:12(8255-8288)Online publication date: 17-Jan-2023
  • (2022)Classification of Fault Prediction: A Mapping StudyPertanika Journal of Science and Technology10.47836/pjst.30.3.2330:3(2157-2171)Online publication date: 25-May-2022
  • (2022)Security versus Compliance: An Empirical Study of the Impact of Industry Standards Compliance on Application SecurityInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402250015232:03(363-393)Online publication date: 21-Apr-2022
  • (2022)Search-based Feature Selection for Cross-Project Fault Prediction2022 IEEE Pune Section International Conference (PuneCon)10.1109/PuneCon55413.2022.10014936(1-5)Online publication date: 15-Dec-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media