ABSTRACT
Background: The prediction performance of a case-based reasoning (CBR) model is influenced by the combination of the following parameters: (i) similarity function, (ii) number of nearest neighbor cases, (iii) weighting technique used for attributes, and (iv) solution algorithm. Each combination of the above parameters is considered as an instantiation of the general CBR-based prediction method. The selection of an instantiation for a new data set with specific characteristics (such as size, defect density and language) is called customization of the general CBR method.
Aims: For the purpose of defect prediction, we approach the question which combinations of parameters works best at which situation. Three more specific questions were studied:
(RQ1) Does one size fit all? Is one instantiation always the best?
(RQ2) If not, which individual and combined parameter settings occur most frequently in generating the best prediction results?
(RQ3) Are there context-specific rules to support the customization?
Method: In total, 120 different CBR instantiations were created and applied to 11 data sets from the PROMISE repository. Predictions were evaluated in terms of their mean magnitude of relative error (MMRE) and percentage Pred(α) of objects fulfilling a prediction quality level α. For the third research question, dependency network analysis was performed.
Results: Most frequent parameter options for CBR instantiations were neural network based sensitivity analysis (as the weighting technique), un-weighted average (as the solution algorithm), and maximum number of nearest neighbors (as the number of nearest neighbors). Using dependency network analysis, a set of recommendations for customization was provided.
Conclusion: An approach to support customization is provided. It was confirmed that application of context-specific rules across groups of similar data sets is risky and produces poor results.
- Aamodt, A. and Plaza, E. 1994. Case-Based reasoning: foundational issues: methodological variations, and system approaches. Artificial Intelligence Communications, vol. 7 (1), pp. 39--52. Google ScholarDigital Library
- Bartsch-Spoerl, B. 1995. Toward the integration of case-based, schema-based, and model-based reasoning for supporting complex design tasks. In Proceeding of the 1 st International Conference on Case-based Reasoning, pp. 145--156. Google ScholarDigital Library
- Brady, A. and Menzies, T. 2010. Case-based reasoning vs parametric models for software quality optimization. In Proceedings of the 6 th International Conference on Predictive Models in Software Engineering, pp. 3:1--3:10. Google ScholarDigital Library
- Catal, C. and Diri, B. 2009. A systematic review of software fault prediction studies. Expert Systems with Applications, vol. 36 (4), pp. 7346--7354. Google ScholarDigital Library
- Conte, S. D., Dunsmore, H., and Shen, V. Y. 1986. Software engineering metrics and models, Benjamin-Cummings Publishing Co. Inc. Google ScholarDigital Library
- El Emam, K., Benlarbi, S., Goel, N., and Rai, S. N. 2001. Comparing case-based reasoning classifiers for predicting high risk software components. The Journal of Systems and Software, vol. 55, pp. 301--320. Google ScholarDigital Library
- Foss, T., Stensrud, E., Kitchenham, B., and Myrtveit, I. 2003. A simulation study of the model evaluation criterion MMRE. IEEE Transactions on Software Engineering, vol. 29 (11), pp. 985--995. Google ScholarDigital Library
- Ganesan, K., Khoshgoftaar, T. M., and Allen, E. B. 2000. Case-based software quality prediction. International Journal of Software Engineering and Knowledge Engineering, vol. 10(2), pp. 139--152.Google ScholarCross Ref
- Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., Kadie, C. 2000. Dependency networks for inference collaborative, filtering, and data visualization. Journal of Machine Learning Research, vol. 1, pp. 49--75. Google ScholarDigital Library
- Idri, A., Abran, A., and Khoshgoftaar, T. M. 2002. Estimating software project effort by analogy based on linguistic values. In Proceeding of the 8 th International Software Metrics Symposium, pp. 21--30. Google ScholarDigital Library
- Khoshgoftaar, T. M., Allen E. B., and Busboom, J. C. 2000. Modeling software quality: the software measurement analysis and reliability toolkit. In Proceeding of the 12 th IEEE International Conference on Tools with Artificial Intelligence, pp. 54--61. Google ScholarDigital Library
- Khoshgoftaar, T. M., Ganesan, K., Allen, E. B., Ross, F. D., Munikoti, R., Goel, N., and Nandi, A. 1997. Predicting fault-prone modules with case-based reasoning," In Proceeding of the 8 th International Symposium on Software Reliability Engineering, pp. 27--35. Google ScholarDigital Library
- Khoshgoftaar, T. M., Seliya, N., and Sundaresh, N. 2006. An empirical study of predicting software faults with case-based reasoning. Software Quality Journal, vol. 14, pp. 85--111. Google ScholarDigital Library
- Kohavi, R. and Provost, F. 1998. Glossary of terms. Machine Learning, vol. 30(12), pp. 271--274. Google ScholarDigital Library
- Kotssiantis, S. and Kanellopoulos, D. 2006. Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, vol. 32(1), pp. 47--58.Google Scholar
- Larose, D. T. 2005. Discovering knowledge in data; an introduction to data mining. John Wiley & Sons, New Jersey, USA. Google ScholarDigital Library
- Li, J. and Ruhe, G. 2008. Analysis of attribute weighting heuristics for analogy-based software effort estimation method AQUA+. Empirical Software Engineering, vol. 13(1), pp. 63--96. Google ScholarDigital Library
- Li, J. and Ruhe, G. 2008. Software effort estimation by analogy using attributes selection based on rough set analysis. International Journal of Software Engineering and Knowledge Engineering, vol. 18 (1), pp. 1--23.Google ScholarCross Ref
- Liu, Y., Khoshgoftaar, T. M., and Seliya, N. 2010. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Transactions on Software Engineering, vol. 36 (6), pp. 852--864. Google ScholarDigital Library
- Menzies, T., Jalali, O., Hihn, J., Baker, D., and Lum, K. 2010. Stable rankings for different effort models. Automated Software Engineering, vol. 17(4), pp. 409--437. Google ScholarDigital Library
- Metrics Data Program, NASA Independent verification and validation facility. http://mdp.ivv.nasa.gov. Last access on 04/05/2011.Google Scholar
- myCBR an open-source case-based reasoning tool developed at DFKI. http://mycbr-project.net/index.html. Last access on 04/05/2011.Google Scholar
- Paikari, E., Richter, M. M., and Ruhe, G. 2010. A comparative study of attribute weighting techniques for software defect prediction using case-based reasoning. In Proceeding of the 22 nd International Conference on Software Engineering and Knowledge Engineering, pp. 380--386.Google Scholar
- Ramamoorthy, C. V., Chandra, C., Ishihara, S., and Ng, Y. 1993. Knowledge-based Tools for Risk Assessment in Software Development and Reuse. In Proceedings of 5 th International Conference on Tools with Artificial Intelligence, pp. 364--371.Google Scholar
- Sayyad, S. J. and Menzies, T. J. 2005. PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa, Canada. http://promise.site.uottawa.ca/SERepository. Last access on 04/05/2011.Google Scholar
- Song, Q., Jia, Z., Shepperd, M., Ying, S., and Liu, J., 2011. A general software defect-proneness prediction framework. IEEE Transactions on Software Engineering, vol. 37 (3), pp. 356--370. Google ScholarDigital Library
- Turhan, B., Bener, A., and Menzies, T. 2010. Regularities in learning defect predictors. In Proceeding of the 11 th International Conference on Product Focused Software, pp. 116--130. Google ScholarDigital Library
- WinMine Toolkit, Machine Learning and Applied Statistics Group, Microsoft Research, http://research.microsoft.com/~dmax/winmine/tooldoc.htm. Last access on 04/05/2011.Google Scholar
- Witten, I. H., and Frank, E. 2005. Data mining: practical machine learning tools and techniques. 2nd Edition. Morgan Kaufmann, San Francisco. Google ScholarDigital Library
- Zhang, H., Nelson, A., and Menzies, T. 2010. On the value of learning from defect dense components for software defect prediction. In Proceedings of the 6 th International Conference on Predictive Models in Software Engineering, pp. 14:1--14:9. Google ScholarDigital Library
Index Terms
- Customization support for CBR-based defect prediction
Recommendations
How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction
Background. Recent years have seen an increasing interest in cross-project defect prediction (CPDP), which aims to apply defect prediction models built on source projects to a target project. Currently, a variety of (complex) CPDP models have been ...
Heterogeneous defect prediction
ESEC/FSE 2015: Proceedings of the 2015 10th Joint Meeting on Foundations of Software EngineeringSoftware defect prediction is one of the most active research areas in software engineering. We can build a prediction model with defect data collected from a software project and predict defects in the same project, i.e. within-project defect ...
Cross-project smell-based defect prediction
AbstractDefect prediction is a technique introduced to optimize the testing phase of the software development pipeline by predicting which components in the software may contain defects. Its methodology trains a classifier with data regarding a set of ...
Comments