Skip to main content
Log in

Time variance and defect prediction in software projects

Towards an exploitation of periods of stability and change as well as a notion of concept drift in software projects

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

It is crucial for a software manager to know whether or not one can rely on a bug prediction model. A wrong prediction of the number or the location of future bugs can lead to problems in the achievement of a project’s goals. In this paper we first verify the existence of variability in a bug prediction model’s accuracy over time both visually and statistically. Furthermore, we explore the reasons for such a high variability over time, which includes periods of stability and variability of prediction quality, and formulate a decision procedure for evaluating prediction models before applying them. To exemplify our findings we use data from four open source projects and empirically identify various project features that influence the defect prediction quality. Specifically, we observed that a change in the number of authors editing a file and the number of defects fixed by them influence the prediction quality. Finally, we introduce an approach to estimate the accuracy of prediction models that helps a project manager decide when to rely on a prediction model. Our findings suggest that one should be aware of the periods of stability and variability of prediction quality and should use approaches such as ours to assess their models’ accuracy in advance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://www.nongnu.org/cvs/

  2. http://www.bugzilla.org/

  3. E.g., http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/ or http://www.cs.columbia.edu/~eaddy/concerntagger/.

  4. Tables can be found in Appendix A.

  5. Note that a complete description can be found in Appendix B and that for all features where authorship is relevant it is determined as the person committing the code into the CVS rather than the developer noted in the comments of the code. However, most of active contributors are committers of a project. For example in the PDT project (http://www.eclipse.org/pdt/people/contributors.php#Seva-%28Wsevolod%29-Lapsha), out of 12 participants 11 of them are committers. Hence, this assumption will not have a great impact on the outcome of the experiments.

  6. E.g. how the individual committers coding behavior synchronizes towards a milestone.

  7. Note that we used the Mann–Whitney test as the test for normality (one-Sample Kolmogorov–Smirnov test: p = 0.055) produced a borderline result. As some still use the t-test for large collections of slightly non-parametric data we also ran an independent-sample t-test and found it to be significant at α = 0.001.

  8. Like above a t-test reconfirmed these findings at α = 0.001.

  9. More precisely, we used FixCache as BugCache is only the theoretical model behind the method. Nevertheless, BugCache is the often-used term for both methods.

  10. Note that the observed number of models (162) that pick random features is significantly different from the expected number of models (1,425) according to a χ 2-test (p < 0.001).

  11. You can find a complete set of the figures in the technical report Ekanayake et al. (2011) online. http://www.ifi.uzh.ch/research/publications/technical-reports.html.

References

  • Ancona D, Chong CL (1996) Entrainment: pace, cycle, and rhythm in organizational behavior. In: Research in organizational behavior, vol 18. JAI Press, Greenwich, pp 251–284

    Google Scholar 

  • Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc YG (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests. In: Proceedings of the 2008 conference of the Center for Advanced Studies on Collaborative Research (CASCON). ACM, New York, pp 304–318

    Chapter  Google Scholar 

  • Bachmann A, Bernstein A (2009) Data retrieval, processing and linking for software process data analysis. Tech. Rep. IFI-2009.0003, University of Zurich, Department of Informatics

  • Bernstein A, Ekanayake J, Pinzger M (2007) Improving defect prediction using temporal features and non linear models. In: IWPSE ’07: ninth international workshop on principles of software evolution, ACM, New York, pp 11–18. doi:10.1145/1294948.1294953

    Google Scholar 

  • Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC/FSE). ACM, New York, pp 121–130

    Chapter  Google Scholar 

  • Brooks FP, Phillips F (1995) The mythical man-month: essays on software engineering. Addison-Wesley, Reading

    Google Scholar 

  • Diehl S, Gall HC, Hassan AE (2009) Guest editors introduction: special issue on mining software repositories. Empir Software Eng 14(3):257–261

    Article  Google Scholar 

  • Eaddy M, Zimmermann T, Sherwood KD, Garg V, Murphy GC, Nagappan N, Aho AV (2008) Do crosscutting concerns cause defects? IEEE Trans Softw Eng 34(4):497–515

    Article  Google Scholar 

  • Ekanayake J, Tappolet J, Gall HC, Bernstein A (2011) Time variance and defect prediction in software projects—additional figures. Tech. Rep. IFI-2011.0004, University of Zurich, Department of Informatics

  • Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675–689. doi:10.1109/32.815326

    Article  Google Scholar 

  • Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661. doi:10.1109/32.859533

    Article  Google Scholar 

  • Hassan AE (2009) Predicting faults using the complexity of code changes. In: ICSE ’09: Proceedings of the 31st international conference on software engineering. IEEE Computer Society, Washington, DC, pp 78–88. doi:10.1109/ICSE.2009.5070510

    Google Scholar 

  • Hassan AE, Holt RC (2005) The top ten list: dynamic fault prediction. In: ICSM ’05: Proceedings of the 21st IEEE international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 263–272. doi:10.1109/ICSM.2005.91

    Chapter  Google Scholar 

  • Kagdi H, Collard ML, Maletic JI (2007) A survey and taxonomy of approaches for mining software repositories in the context of software evolution. J Softw Maint Evol 19(2):77–131. doi:10.1002/smr.344

    Article  Google Scholar 

  • Kenmei B, Antoniol G, Di Penta M (2008) Trend analysis and issue prediction in large-scale open source systems. In: Proc 12th European conference on software maintenance and reengineering CSMR 2008. IEEE Computer Society, Los Alamitos, pp 73–82. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4493302

    Chapter  Google Scholar 

  • Khoshgoftaar TM, Allen EB, Goel N, Nandi A, McMullan J (1996) Detection of software modules with high debug code churn in a very large legacy system. In: ISSRE ’96: Proceedings of the the seventh international symposium on software reliability engineering. IEEE Computer Society, Washington, DC, p 364

    Chapter  Google Scholar 

  • Kim S, Zimmermann T, Whitehead Jr EJ, Zeller A (2007) Predicting faults from cached history. In: ICSE ’07: Proceedings of the 29th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 489–498. doi:10.1109/ICSE.2007.66

    Google Scholar 

  • Knab P, Pinzger M, Bernstein A (2006) Predicting defect densities in source code files with decision tree learners. In: MSR ’06: Proceedings of the 2006 international workshop on mining software repositories. ACM, New York, pp 119–125. doi:10.1145/1137983.1138012

    Chapter  Google Scholar 

  • Ko AJ, Chilana PK (2010) How power users help and hinder open bug reporting. In: CHI ’10: Proceedings of the 28th international conference on human factors in computing systems. ACM, Atlanta, pp 1665–1674

    Google Scholar 

  • Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496. doi:10.1109/TSE.2008.35

    Article  Google Scholar 

  • Li PL, Herbsleb J, Shaw M (2005) Forecasting field defect rates using a combined time-based and metrics-based approach: a case study of openbsd. In: ISSRE ’05: Proceedings of the 16th IEEE international symposium on software reliability engineering. IEEE Computer Society, Washington, DC, pp 193–202. doi:10.1109/ISSRE.2005.19

    Google Scholar 

  • Mockus A, Votta LG (2000) Identifying reasons for software changes using historic databases. In: ICSM ’00: Proceedings of the international conference on software maintenance (ICSM’00). IEEE Computer Society, Washington, DC, p 120

    Google Scholar 

  • Nagappan N, Ball T (2005) Static analysis tools as early indicators of pre-release defect density. In: ICSE ’05: Proceedings of the 27th international conference on software engineering. ACM, New York, NY, pp 580–586. doi:10.1145/1062455.1062558

    Chapter  Google Scholar 

  • Ostrand T, Weyuker E, Bell R (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355

    Article  Google Scholar 

  • Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42(3):203–231

    Article  MATH  Google Scholar 

  • Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo

    Google Scholar 

  • Tsymbal A (2004) The problem of concept drift: definitions and related work. Tech. rep., Department of Computer Science Trinity College

  • Vorburger P, Bernstein A (2006) Entropy-based concept shift detection. In: ICDM ’06: Proceedings of the sixth international conference on data mining. IEEE Computer Society, Washington, DC, pp 1113–1118. doi:10.1109/ICDM.2006.66

    Chapter  Google Scholar 

  • Widmer G, Kubat M (1993) Effective learning in dynamic environments by explicit context tracking. In: ECML ’93: Proceedings of the European conference on machine learning. Springer, London, pp 227–243

    Chapter  Google Scholar 

  • Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Mateo

    MATH  Google Scholar 

  • Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: PROMISE ’07: Proceedings of the third international workshop on predictor models in software engineering. IEEE Computer Society, Washington, DC, p 9. doi:10.1109/PROMISE.2007.10

    Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their many detailed comments that helped us to significantly improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jayalath Ekanayake.

Additional information

Editors: Jim Whitehead and Michael Godfrey

J. Ekanayake was partially supported by funding from the Sabaragamuwa University of Sri Lanka while J. Tappolet was partially supported by funding from the Swiss National Science Foundation award number 200021–112330.

Appendices

Appendix A: Component List

Tables 13, 14 and 15 list all the investigated components and the number of files that each component consists.

Table 13 Eclipse: Investigated components and number of files
Table 14 Netbeans: Investigated components and number of files
Table 15 Mozilla: Investigated components and number of files

Appendix B: Detailed Feature Description

This section describes the features used in the paper as well as explains their computation and rationale.

revision   We consider a revision as a change made to a file for some reason. The feature revision represents the number of changes made to a file during training periods. Both Graves et al. (2000) and Khoshgoftaar et al. (1996) found that past changes are good defect indicators.

activityRate   This feature measures how often a file has been revised during the training periods and is computed by dividing the number of revisions during the training period by the length of the training period (in months). Hassan and Holt (2005) concluded that a high frequency of changes in a file is a good defect predictor.

lineAdded , lineDeleted and totalLineOperations   Several studies showed that past changes are good defect predictors (Graves et al. 2000; Khoshgoftaar et al. 1996). Therefore, we further quantify the amount of change done by authors using the features lineAdded and lineDeleted that describe the number of lines of code added and deleted during training periods. Further, we introduce the total amount of work done for a revision by adding those two features resulting the feature totalLineOperations.

grownPerMonth   This feature provides information about the growth rate of a project or file in the training periods. We compute the amount of grown using the total number of line added and deleted during that time period. Usually, we subtract the total number of line deleted from the total number of line added and then average this value by dividing this number by the length of the training period (in months). Therefore, this number can be ether positive (representing growth) or negative (representing shrinkage). We introduced this feature to address issues that may arise due to too fast change.

lineOperationRRevision   This feature captures the average size of a revision in terms of number of lines of code added and deleted. We simply add the total numbers of lines of code added and deleted during training periods and divide that amount by the number of revisions during that period.

chanceRevision and chanceBug   These two features provide the probability of having a revision or a bug in a file in the future. These features mimic the award winning BugCache approach (Kim et al. 2007), which proposes that more recently fixed files are more vulnerable for bugs. We model this probability using the formula 1/2i, where i represents how far back (in months) the latest revision or bug occurred from the prediction time period. If the latest revision or bug occurrence is far from the prediction time period, then i is large and the overall probability of having a bug (or revision) in the near future is low.

blockerFixes , criticalFixes , majorFixes , minorFixes , normalFixes and trivialFixes   These six features report the number of different types of bugs fixed during training periods. The bugs are categorized according to their severity such as blocker, critical, major, minor, normal and trivial. We can find the severity information of fixed bugs from bugzilla database. If a revision has a referenced or linked entry in the bugzilla database and the severity of that entry is marked as one of the above categories then we consider that the revision is for a bug fixing activity. Further, the bug-fixing revision date falls into the training periods then we count as one bug has been fixed in the assigned category. Our intention of introducing these features is to uncover any correlation between the severity and defects.

enhancementFixes   This feature counts the number of revisions made for enhancements requested during the training period of the models. In the bug categorization process, authors find that some requests are not for fixing bugs, but for enhancements. Hence, we introduce the feature enhancementFixes that counts such fixed enhancements.

blockerReported , criticalReported , majorReported , minorReported , normalReported and trivialReported   These six features provide information about the number of reported bugs in terms of severity. We introduce these features as not all reported bugs during a training period may be fixed within that period. Note that we consider the opening date and the reported dates are same. If an opening date falls into the training period then we count as one bug has been reported in the assigned category.

enhancementReported   This feature counts number of enhancements reported during training periods. The reported is determined as above.

p1-fixes , p2-fixes , p3-fixes , p4-fixes and p5-fixes   Each Bug report is further categorized based on its priority such that the highest and the lowest priority bugs are categorized as P1 and P5 respectively. The other priorities are fallen in between P1 and P5. Theses five features describe the number of priority wise bugs fixed during training periods. Bug fixing dates are determined as in the above cases. If a bug-fixing date falls into the training periods then we count as one bug has been fixed in the assigned category.

p1-reported , p2-reported , p3-reported , p4-reported and p5-reported   These five feature provide information about the number of bugs reported with corresponding priority during training periods. The reported dates are determined as in the above.

lineAddedI , lineDeletedI and totalLineOperationsI   Theses three features provide information about lines of code added, deleted, and total lines of code operated (or changed) to fix bugs during training periods. If a revision has a referenced entry or link in the bugzilla database and the corresponding bug report is not marked as an enhancement but has a severity levels then we consider that revision to be a big fixing activity. Furthermore, the information in the CVS log allows us to extract how many lines of code where added and deleted for that revision supplying the basis for lineAddedI and lineDeletedI. Adding these two features results in totalLineOpertaionsI. These three variables capture how much work (in terms of number of lines of code) is accomlished by the authors to fix bugs.

lineOperationIRBugFixes   This feature measures the average number of lines of code changed to fix bugs during the training periods. Thus, this features captures the size of the bugs fixed and provides any correlation between the average size of fixed bugs and the defects. We can derive the feature lineOperationIRBugFixes by dividing the total number of lines changed to fix bugs by the total number of bugs fixed.

lineOperationIRTotalLines   This feature describes the work effort by the authors to fix bugs relative to the other work during the training periods. We already computed the total number of lines changed (or operated) to fix bugs and other activities such as enhancements. Hence, we can derive this feature by dividing the total number of lines to fix bugs by the total number of lines changed for any other activity.

lifeTimeBlocker , lifeTimeCritical , lifeTimeMajor , lifeTimeMinor , lifeTimeNormal and lifeTimeTrivial   These six feature describe about the lifetime of different types of bugs fixed during training periods. Both Bugzilla and CVS databases provide the information about opening and closing dates of the bugs. Further, Bugzilla provides the severity level of a bugs. Consequently, we can compute the lifetimes of any type of bug by taking the difference between the closing and the opening dates. Note that even when the opening dates lie outside the considered training periods we use them to compute the bug lifetimes.

hasBug   This is the target variable of some of our models. This variable describes whether any kind of bug (blocking, critical, major, minor, normal, or trivial) has been reported or not in target periods.

Appendix C: Dataset Format

This section describe the format of datasets used in the first experiment.

A dataset contains two parts, labeling and feature computation. The length of the labellings period is usually one month and in this period, we record the number of bugs reported—target variable—for each observed file. The length of the feature computation period—training period—can be extended from one month to the maximum length of the observed period and further, this period starts one month before the labeling period and expands into past. In this period, we compute features listed in Table 2 for each file, which we recorded the number of bugs reported during the labeling period. The description of each feature can be found in the above section. Following is the mathematical notion of the dataset:

Assume that the observed period is d months. Y T  = { y T,1, y T,2, ..., y T,j , ...., y T,s } is a vector of dimension s ( s is the number of observed files ) and y T,j is the number of bugs reported for file j at T, where 1 < T ≤ d. if X t  = {f t,1, f t,2, f t,i , ....., f t,n } is a feature vector of dimension n and f t,i is a file feature i computed from the history information at time t, where n ∈ N and 1 < t ≤ d − 1, t < T and s > > > n, then constructed dataset is given by \(\sum_{t=x}^{T-1} X_t,Y_T\) . By changing the x and T variables we can generate different datasets.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ekanayake, J., Tappolet, J., Gall, H.C. et al. Time variance and defect prediction in software projects. Empir Software Eng 17, 348–389 (2012). https://doi.org/10.1007/s10664-011-9180-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-011-9180-x

Keywords

Navigation