skip to main content
10.1145/2896839.2896843acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Defect prediction on a legacy industrial software: a case study on software with few defects

Published: 14 May 2016 Publication History

Abstract

Context: Building defect prediction models for software projects is helpful for reducing the effort in locating defects. In this paper, we share our experiences in building a defect prediction model for a large industrial software project. We extract product and process metrics to build models and show that we can build an accurate defect prediction model even when 4% of the software is defective.
Objective: Our goal in this project is to integrate a defect predictor into the continuous integration (CI) cycle of a large software project and decrease the effort in testing.
Method: We present our approach in the form of an experience report. Specifically, we collected data from seven older versions of the software project and used additional features to predict defects of current versions. We compared several classification techniques including Naive Bayes, Decision Trees, and Random Forest and resampled our training data to present the company with the most accurate defect predictor.
Results: Our results indicate that we can focus testing efforts by guiding the test team to only 8% of the software where 53% of actual defects can be found. Our model has 90% accuracy.
Conclusion: We produce a defect prediction model with high accuracy for a software with defect rate of 4%. Our model uses Random Forest, that which we show has more predictive power than Naive Bayes, Logistic Regression and Decision Trees in our case.

References

[1]
E. Alpaydin. Introduction to Machine Learning. The MIT Press, 3rd edition, 2014.
[2]
ClearCase, http://www-03.ibm.com/software/products/en/clearcase.
[3]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res., 16(1):321--357, June 2002.
[4]
C. Chen, A. Liaw, and L. Breiman. Using Random Forest to Learn Imbalanced Data. Technical report, Department of Statistics, University of Berkeley, 2004.
[5]
A. Chug and S. Dhall. Software defect prediction using supervised learning algorithm and unsupervised learning algorithm. In Confluence 2013: The Next Generation Information Technology Summit (4th International Conference), pages 173--179. IET, 2013.
[6]
CKJM Extended, http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/.
[7]
M. D'Ambros, M. Lanza, and R. Robbes. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empirical Softw. Engg., 17(4-5):531--577, Aug. 2012.
[8]
B. Ghotra, S. McIntosh, and A. E. Hassan. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE '15, 2015.
[9]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10--18, Nov. 2009.
[10]
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng., 38(6):1276--1304, Nov. 2012.
[11]
JIRA, https://www.atlassian.com/software/jira.
[12]
Y. Kastro and A. B. Bener. A defect prediction method for software versioning. Software Quality Journal, 16(4):543--562, Dec. 2008.
[13]
L. Madeyski and M. Jureczko. Which process metrics can significantly improve defect prediction models? an empirical study. Software Quality Journal, 23(3):393--422, Sept. 2015.
[14]
R. Malhotra. A systematic review of machine learning techniques for software fault prediction. Appl. Soft Comput., 27(C):504--518, Feb. 2015.
[15]
Description of Product Metrics, http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/metric.html.
[16]
J. Moeyersoms, E. J. de Fortuny, K. Dejaeger, B. Baesens, and D. Martens. Comprehensible software fault and effort prediction: A data mining approach. Journal of Systems and Software, 100:80--90, 2015.
[17]
D. Spinellis. Tool writing: A forgotten art? IEEE Softw., 22(4):9--11, July 2005.
[18]
SonarQube, http://www.sonarqube.org.
[19]
C. Tantithamthavorn, S. McIntosh, A. E. Hassan, A. Ihara, and K. Matsumoto. The impact of mislabelling on the performance and interpretation of defect prediction models. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE '15, 2015.
[20]
A. Tosun, A. Bener, B. Turhan, and T. Menzies. Practical considerations in deploying statistical methods for defect prediction: A case study within the turkish telecommunications industry. Inf. Softw. Technol., 52(11):1242--1257, Nov. 2010.
[21]
S. Yoo and M. Harman. Regression testing minimization, selection and prioritization: A survey. Softw. Test. Verif. Reliab., 22(2):67--120, Mar. 2012.

Cited By

View all
  • (2024)Drift Detection in Legacy Systems Using Machine Learning Techniques2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512292(1-6)Online publication date: 1-Mar-2024
  • (2024)Systematic Literature Review on Application of Learning-Based Approaches in Continuous IntegrationIEEE Access10.1109/ACCESS.2024.342427612(135419-135450)Online publication date: 2024
  • (2023)Commit-Based Class-Level Defect Prediction for Python ProjectsIEICE Transactions on Information and Systems10.1587/transinf.2022MPP0003E106.D:2(157-165)Online publication date: 1-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CESI '16: Proceedings of the 4th International Workshop on Conducting Empirical Studies in Industry
May 2016
48 pages
ISBN:9781450341547
DOI:10.1145/2896839
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. defect prediction
  2. experience report
  3. feature selection
  4. process metrics
  5. random forest

Qualifiers

  • Research-article

Conference

ICSE '16
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Drift Detection in Legacy Systems Using Machine Learning Techniques2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512292(1-6)Online publication date: 1-Mar-2024
  • (2024)Systematic Literature Review on Application of Learning-Based Approaches in Continuous IntegrationIEEE Access10.1109/ACCESS.2024.342427612(135419-135450)Online publication date: 2024
  • (2023)Commit-Based Class-Level Defect Prediction for Python ProjectsIEICE Transactions on Information and Systems10.1587/transinf.2022MPP0003E106.D:2(157-165)Online publication date: 1-Feb-2023
  • (2023)Prediction of Defective Artifacts by Removing Redundant Metrics in Software Development Life Cycle (SDLC)2023 6th International Conference on Contemporary Computing and Informatics (IC3I)10.1109/IC3I59117.2023.10397955(1914-1917)Online publication date: 14-Sep-2023
  • (2022)Software Defect Prediction: State of the Art SurveyInternational Journal of Innovative Technology and Exploring Engineering10.35940/ijitee.G9993.061172211:7(32-35)Online publication date: 30-Jun-2022
  • (2021)Software Project Management Using Machine Learning Technique—A ReviewApplied Sciences10.3390/app1111518311:11(5183)Online publication date: 2-Jun-2021
  • (2021)Software Testing Effort Estimation and Related ProblemsACM Computing Surveys10.1145/344269454:3(1-38)Online publication date: 17-Apr-2021
  • (2021)External Threat Detection in Smart Sensor Networks Using Machine Learning ApproachSmart Sensor Networks10.1007/978-3-030-77214-7_7(151-178)Online publication date: 2-Sep-2021
  • (2020)Machine Learning Based Bug Prediction Engine For Smart Contracts2020 Turkish National Software Engineering Symposium (UYMS)10.1109/UYMS50627.2020.9247056(1-6)Online publication date: 7-Oct-2020
  • (2020)Design and Development of Machine Learning Technique for Software Project Risk Assessment - A Review2020 8th International Conference on Information Technology and Multimedia (ICIMU)10.1109/ICIMU49871.2020.9243459(354-362)Online publication date: 24-Aug-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media