Abstract
Defect prediction is an important task for preserving software quality. Most prior work on defect prediction uses software features, such as the number of lines of code, to predict whether a file or commit will be defective in the future. There are several reasons to keep the number of features that are used in a defect prediction model small. For example, using a small number of features avoids the problem of multicollinearity and the so-called ‘curse of dimensionality’. Feature selection and reduction techniques can help to reduce the number of features in a model. Feature selection techniques reduce the number of features in a model by selecting the most important ones, while feature reduction techniques reduce the number of features by creating new, combined features from the original features. Several recent studies have investigated the impact of feature selection techniques on defect prediction. However, there do not exist large-scale studies in which the impact of multiple feature reduction techniques on defect prediction is investigated. In this paper, we study the impact of eight feature reduction techniques on the performance and the variance in performance of five supervised learning and five unsupervised defect prediction models. In addition, we compare the impact of the studied feature reduction techniques with the impact of the two best-performing feature selection techniques (according to prior work). The following findings are the highlights of our study: (1) The studied correlation and consistency-based feature selection techniques result in the best-performing supervised defect prediction models, while feature reduction techniques using neural network-based techniques (restricted Boltzmann machine and autoencoder) result in the best-performing unsupervised defect prediction models. In both cases, the defect prediction models that use the selected/generated features perform better than those that use the original features (in terms of AUC and performance variance). (2) Neural network-based feature reduction techniques generate features that have a small variance across both supervised and unsupervised defect prediction models. Hence, we recommend that practitioners who do not wish to choose a best-performing defect prediction model for their data use a neural network-based feature reduction technique.












Similar content being viewed by others
Notes
Note that the ranks are slightly different from Fig. 4 due to the fact that Scott-Knott ESD is a clustering algorithm, and hence affected by the total set of input distributions. For more information see https://github.com/klainfo/ScottKnottESD.
References
Abaei G, Rezaei Z, Selamat A (2013) Fault prediction by utilizing self-organizing map and threshold. In: Proceedings of the international conference on control system, computing and engineering (ICCSCE), IEEE, pp 465–470
Arora I, Tetarwal V, Saha A (2015) Open issues in software defect prediction. Procedia Comput Sci 46:906–912
Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng (TSE) 22(10):751–761
Bellman R (1957) Dynamic Programming. Princeton University Press, Princeton
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the 7th international conference on knowledge discovery and data mining, ACM, pp 245–250
Bishnu PS, Bhattacherjee V (2012) Software fault prediction using quad tree-based k-means clustering algorithm. IEEE Trans Knowl Data Eng 24(6):1146–1150
Challagulla VUB, Bastani FB, Yen IL, Paul RA (2008) Empirical assessment of machine learning based software defect prediction techniques. Int J Artif Intell Tools 17(02):389–400
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng (TSE) 20(6):476–493
Cohen J (1988) Statistical power analysis for the behavioral sciences
D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of the 7th international conference on mining software repositories (MSR), IEEE, pp 31–41
Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybernet 3:32–57
Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, ACM, pp 163–174
Farrar DE, Glauber RR (1967) Multicollinearity in regression analysis: the problem revisited. Rev Econ Stat 49(1):92–107
Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Software: Practice and Experience 41(5):579–606
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the 37th international conference on software engineering (ICSE), IEEE Press, pp 789–800
Ghotra B, Mcintosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: Proceedings of the 14th international conference on mining software repositories (MSR), IEEE Press, pp 146–157
Gray AR, Macdonell SG (1999) Software metrics data analysis–exploring the relative performance of some commonly used modeling techniques. Empir Softw Eng 4 (4):297–316
Guo L, Cukic B, Singh H (2003) Predicting fault prone modules by the dempster-shafer belief networks. In: Proceedings of the 18th international conference on automated software engineering (ASE), IEEE, pp 249–252
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18
Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato Hamilton
Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447
Halstead MH (1977) Elements of Software Science, vol 7. Elsevier, New York
Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of the international workshop on artificial neural networks, Springer, pp 195–201
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108
Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering (ICSE), IEEE Computer Society, pp 78–88
He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199
Herbold S (2013) Training data selection for cross-project defect prediction. In: Proceedings of the 9th international conference on predictive models in software engineering, ACM, p 6
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma 2015. Article ID 198363, 13 pp
Ho TK (1995) Random decision forests. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1. IEEE, pp 278–282
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, ACM, p 9
Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106
Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: an Introduction to Cluster Analysis, vol 344. Wiley, Hoboken
Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE), IEEE Computer Society, pp 489–498
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Kuhn M (2015) Caret: classification and regression training. Astrophysics Source Code Library
Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59 (1):161–205
van der Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15(1):3221–3245
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605
Martinetz T, Schulten K (1991) A “neural-gas” network learns topologies. Artificial Neural Networks 1:397–402
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng (TSE) SE-2 (4):308–320
McDonald JH (2014) Handbook of Biological Statistics, 3rd edn. Sparky House Publishing, Baltimore
Menzies T, Greenwald J, Frank A (2007a) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng (TSE) 33(1):2–13
Menzies T, Owen D, Richardson J (2007b) The strangest thing about software. Computer 40(1):54–60
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering (ICSE), IEEE, pp 181–190
Muthukumaran K, Rallapalli A, Murthy N (2015) Impact of feature selection techniques on bug prediction models. In: Proceedings of the 8th India software engineering conference, ACM, pp 120–129
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering (ICSE), ACM, pp 452–461
Nam J (2014) Survey on software defect prediction. HKUST PhD Qualifying Examination, Department of Compter Science and Engineerning. The Hong Kong University of Science and Technology, Tech. Rep
Nam J, Fu W, Kim S, Menzies T, Tan L (2017) Heterogeneous defect prediction. IEEE Transactions on Software Engineering
Nam J, Kim S (2015) CLAMI: defect prediction on unlabeled datasets. In: Proceedings of the 30th international conference on automated software engineering (ASE), IEEE, pp 452–463
Nam J, Kim S (2015) Heterogeneous defect prediction. In: Proceedings of the 10th joint meeting on foundations of software engineering (FSE), ACM, pp 508–519
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering (ICSE), IEEE Press, pp 382–391
Neumann DE (2002) An enhanced neural network technique for software risk analysis. IEEE Trans Softw Eng (TSE) 28(9):904–912
Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210
Peters F, Menzies T, Gong L, Zhang H (2013) Balancing privacy and utility in cross-company defect prediction. IEEE Trans Softw Eng 39(8):1054–1068
Petric J., Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the NASA software defect data sets. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering, ACM, pp 1–5
Quinlan R (1993) C4.5: programs for machine learning. morgan kaufmann publishers
Rathore SS, Gupta A (2014) A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction. In: Proceedings of the 7th India software engineering conference, ACM, p 7
Ren J, Qin K, Ma Y, Luo G (2014) On software defect prediction using machine learning. J Appl Math 2014. Article ID 785435, 8 pp
Rodríguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J (2007a) Detecting fault modules applying feature selection to classifiers. In: Proceedings of the 2007 international conference on information reuse and integration, IEEE, pp 667–672
Rodriguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J, Garre M (2007b) Attribute selection in software engineering datasets for detecting fault modules. In: Proceedings of the 2007 EUROMICRO conference on software engineering and advanced applications, IEEE, pp 418–423
Rokach L, Maimon O (2005) Clustering methods. In: Data mining and knowledge discovery handbook, Springer, pp 321–352
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng (TSE) 39(9):1208–1215
Shihab E (2014) Practical software quality prediction. In: Proceedings of the 2014 international conference on software maintenance and evolution (ICSME), IEEE, pp 639–644
Shihab E, Jiang ZM, Ibrahim WM, Adams B, Hassan AE (2010) Understanding the impact of code and process metrics on post-release defects: A case study on the Eclipse project. In: Proceedings of the international symposium on empirical software engineering and measurement (ESEM), ACM, pp 4:1–4:10
Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng (TSE) 39(4):552–569
Smolensky P (1986) Information processing in dynamical systems: Foundations of harmony theory. Tech. rep., DTIC Document
Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: Pitfalls and challenges. In: Proceedings of the 40th international conference on software engineering: software engineering in practice track (ICSE-SEIP), ACM, pp 286–295
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 321–332
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng (TSE) 43(1):1–18
Tassey G (2002) The economic impacts of inadequate infrastructure for software testing. National Institute of Standards and Technology
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17 (4):395–416
Xu Z, Liu J, Yang Z, An G, Jia X (2016) The impact of feature selection on defect prediction performance: an empirical comparison. In: Proceedings of the 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 309–320
Yang B, Yin Q, Xu S, Guo P (2008) Software quality prediction using affinity propagation algorithm. In: Proceedings of the international joint conference on neural networks, IEEE, pp 1891–1896
Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 309–320
Zhang H (2004) The optimality of Naive Bayes. In: FLAIRS conference, AAAI press
Zhong S, Khoshgoftaar TM, Seliya N (2004) Unsupervised learning for expert-based software quality estimation. In: HASE, Citeseer, pp 149–155
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC-FSE), ACM, pp 91–100
Zwillinger D, Kokoska S (1999) CRC Standard Probability and Statistics Tables and Formulae. Crc Press, Boca Raton
Acknowledgment
This work was partially supported by NSERC as well as JSPS KAKENHI (Grant Numbers: JP16K12415 and JP18H03222).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Federica Sarro
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kondo, M., Bezemer, CP., Kamei, Y. et al. The impact of feature reduction techniques on defect prediction models. Empir Software Eng 24, 1925–1963 (2019). https://doi.org/10.1007/s10664-018-9679-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-018-9679-5