Abstract
Identifying a reliable fault prediction technique is the key requirement for building effective fault prediction model. It has been found that the performance of fault prediction techniques is highly dependent on the characteristics of the fault dataset. To mitigate this issue, researchers have evaluated and compared a plethora of fault prediction techniques by varying the context in terms of domain information, characteristics of input data, complexity, etc. However, the lack of an accepted benchmark makes it difficult to select fault prediction technique for a particular context of prediction. In this paper, we present a recommendation system that facilitates the selection of appropriate technique(s) to build fault prediction model. First, we have reviewed the literature to elicit the various characteristics of the fault dataset and the appropriateness of the machine learning and statistical techniques for the identified characteristics. Subsequently, we have formalized our findings and built a recommendation system that helps in the selection of fault prediction techniques. We performed an initial appraisal of our presented system and found that proposed recommendation system provides useful hints in the selection of the fault prediction techniques.
Similar content being viewed by others
References
Sheskin DJ (2003) Handbook of parametric and nonparametric statistical procedures. CRC Press, Boca Raton
Zimmermann T, Nagappan N, Zeller A (2008) Predicting bugs from history. Softw Evol J. Springer, Berlin, pp 69-88
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng J
Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675-689
Briand LC, Daly JW, Wust J (1998) A unified framework for cohesion measurement in object-oriented systems. Empir Softw Eng 3(1):65-117
Alshayeb M, Li W (2003) An empirical validation of object-oriented metrics in two different iterative software processes. IEEE Trans Softw Eng 29(11):1043-1049
Li W, Henry S (1993) Object-oriented metrics that predict maintainability. J Syst Softw 23(2):111-122
Xing F, Guo P, Lyu MR (2005) A novel method for early software quality prediction based on support vector machine. In: Proceeding of 16th IEEE international symposium on software reliability engineering, pp 10-19
Khoshgoftaar TM, Ganesan K, Allen EB, Ross FD, Munikoti R, Goel N, Nandi A (1997) Predicting fault-prone modules with case-based reasoning. In: Proceedings of 8th international symposium on software reliability engineering, pp 27-35
Guo L, Ma Y, Cukic B, Singh H (2004) Robust prediction of fault-proneness by random forests. In: Proceeding of 15th international symposium on software reliability engineering, pp 417-428
Catal C, Diri B (2009) Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inf Sci J 179(8):1040-1058
Challagulla UV, Bastani FB, Yen IL (2006) A unified framework for defect data analysis using the mbr technique. In: Proceeding of 18th IEEE international conference on tools with artificial intelligence, pp 39-46
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13(5):561-595
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):1806-1817
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485-496
Vandecruys O, Martens D, Baesens B, Mues C, De Backer M, Haesen R (2008) Mining software repositories for comprehensible software fault prediction models. J Syst Softw 81(5):823-839
Dejaeger K, Verbraken T, Baesens B (2013) Toward comprehensible software fault prediction models using bayesian network classifiers. IEEE Trans Softw Eng 39(2):237-257
Kanmani S, Uthariaraj VR, Sankaranarayanan V, Thambidurai P (2007) Object-oriented software fault prediction using neural networks. Inf Softw Technol 49(5):483-492
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th joint meeting of the ESEC and FSE, pp 91-100
Pickard L, Kitchenham B, Linkman S (1999) An investigation of analysis techniques for software datasets. In: Proceedings of 6th international software metrics symposium, pp 130-142
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington
Martinez J, Fuentes O (2005) Using c4.5 as variable selection criterion in classification tasks. In: Proceedings of the 9th international conference on artificial intelligence and soft computings. Benidrom, Spain
Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of emerging artificial intelligence applications in computer engineering, pp 3-24
Fitzpatrick JM, Grefenstette JJ (1988) Genetic algorithms in noisy environments. Mach Learn 3(2-3):101-120
Rokach L (2005) Ensemble methods for classifiers. In: Data mining and knowledge discovery handbook. Springer, Berlin, pp 957-980
Xuan L, Zhigang C, Fan Y (2013) Exploring of clustering algorithm on class-imbalanced data. In: Proceeding of 8th international conference on computer science and education. IEEE, New York, pp 89-93
Manago M, Kodratoff Y (1987) Noise and knowledge acquisition. In: IJCAI, pp 348-354
Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw Pract Exp 41(5):579-606
Rodriguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J, Garre M (2007) Attribute selection in software engineering datasets for detecting fault modules. In: Proceedings of 33rd EUROMICRO conference on software engineering and advanced applications, pp 418-423
Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653-661
Charu C (2013) Aggarwal. Outlier analysis. Springer Science and Business Media, Berlin
Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521-530
Calikli G, Bener A (2013) An algorithmic approach to missing data problem in modeling human aspects in software development. In: Proceedings of 9th international conference on predictive models in software engineering. ACM, New York
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: Proceeding of international conference on software engineering
Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: 33rd international conference on software engineering, pp 481-490
Grbac T, Mausa G, Basic BD (2013) Stability of software defect prediction in relation to levels of data imbalance. In: SQAMIA, pp 1-10
Vu B, Challagulla FB, Bastani IL, Paul RA (2008) Empirical assessment of machine learning based software defect prediction techniques. Int J Artif Intell Tools 17(02):389-400
Succi G, Pedrycz W, Djokic S, Zuliani P, Russo B (2005) An empirical exploration of the distributions of the chidamber and kemerer object-oriented metrics suite. Empir Softw Eng 10(1):81-104
Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8(1):87-102
Murphey YL, Guo H, Feldkamp LA (2004) Neural learning from unbalanced data. Appl Intell 21(2):117-128
Smith MR, Martinez T (2011) Improving classification accuracy by identifying and removing instances that should be misclassified. In: Proceeding of 2011 international joint conference on neural networks, pp 2690-2697
Sharpe PK, Solly RJ (1995) Dealing with missing values in neural network-based diagnostic systems. Neural Comput Appl 3(2):73-77
Venkatesh S, Gopal S (2011) Robust heteroscedastic probabilistic neural network for multiple source partial discharge pattern recognition-significance of outliers on classification capability. Exp Syst Appl 38(9):11501-11514
Haupt RL, Haupt SE (2004) Practical genetic algorithms. Wiley, New York
Allison PD (2001) Missing data, vol 136. Sage Publications, Chennai
Smith SF (1980) A learning system based on genetic adaptive algorithms. PhD thesis
Afzal W, Torkar R, Feldt R (2008) Prediction of fault count data using genetic programming. In: Proceeding of international multitopic conference, pp 349-356
Fonseca CM, Fleming PJ (1993) Multiobjective genetic algorithms. In: IEE colloquium on genetic algorithms for control systems engineering. IET, Thiruvananthapuram, pp 1-6
Li F, Li H (2012) Svm classification for large data sets by support vector estimating and selecting. In: Recent advances in computer science and information engineering. Springer, Berlin, pp 775-781
Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. Math Intell 27(2):83-85
Debruyne M (2009) An outlier map for support vector machine classification. Ann Appl Stat 1566-1580
Khoshgoftaar TM, Seliya N (2003) Fault prediction modeling for software quality estimation: comparing commonly used techniques. Empir Softw Eng 8(3):255-283
Mauvsa G, Grbac TG, Bavsic BD (2012) Multivariate logistic regression prediction of fault-proneness in software modules. In: Proceedings of the 35th international convention, pp 698-703
Ratanamahatana CA, Gunopulos D (2002) Scaling up the naive bayesian classifier: using decision trees for feature selection
Briand L, Devanbu P, Melo W (1997) An investigation into coupling measures for c++. In: Proceedings of 19th international conference on software engineering, pp 412-421
Ghimire B, Rogan J, Galiano VR, Panday P, Neeti N (2012) An evaluation of bagging, boosting, and random forests for land-cover classification in cape cod, massachusetts, usa. GISci Remote Sens 49(5):623-643
Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings of the 26th international conference on very large data bases, VLDB ’00, pp 506-515
Jiamthapthaksin R, Eick CF, Vilalta R (2009) A framework for multi-objective clustering and its application to colocation mining. In: Advanced data mining and applications. Springer, Berlin, pp 188-199
Acuna E, Rodriguez C (2004) The treatment of missing values and its effect on classifier accuracy. In: Classification, clustering, and data mining applications. Springer, Berlin, pp 639-647
Amatriain X, Jaimes A, Oliver N, Pujol JM (2011) Data mining methods for recommender systems. In: Recommender systems handbook. Springer, Berlin, pp 39-71
Ma Y, Guo L, Cukic B (2006) A statistical framework for the prediction of fault-proneness. In: Advances in machine learning application in software engineering. Idea Group Inc, Calgary, pp 237-265
Karimi K, Hamilton HJ (2002) Timesleuth: a tool for discovering causal and temporal rules. In: Proceedings of 14th IEEE international conference on tools with artificial intelligence. IEEE, New York, pp 375-380
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491-502
Law HCM, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition, vol 2, pp II-424
Mitchell TM (1997) Machine learning, vol 1. McGraw-Hill, USA
Owoc ML, Galant V (1999) Validation of rule-based systems generated by classification algorithms. In: Evolution and challenges in system development. Springer, Berlin, pp 459-467
Khoshgoftaar TM, Seliya N (2002) Tree-based software quality estimation models for fault prediction. In: Proceedings of the eighth IEEE symposium on software metrics. IEEE, New York, pp 203-214
Arisholm E, Briand LC, Fuglerud M (2007) Data mining techniques for building fault-proneness models in telecom java software. In: Proceeding of 18th IEEE international symposium on software reliability, pp 215-224
Elish OK, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649-660
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276-1304
Shihab E (2012) An exploration of challenges limiting pragmatic software defect prediction. PhD thesis, Queens University
Nam J (2014) Survey on software defect prediction. PhD Thesis, Hong Kong University of Science and Technology
Acknowledgments
The authors would like to thank the editor of the journal and the anonymous reviewers for their valuable comments, guidance, and suggestions that have really improved the quality of the paper and have led to the paper in its current form. Further, we would like to thank the Ministry of Human Resource Development (MHRD), India for providing institute assistantship.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Rathore, S.S., Kumar, S. A decision tree logic based recommendation system to select software fault prediction techniques. Computing 99, 255–285 (2017). https://doi.org/10.1007/s00607-016-0489-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-016-0489-6