Skip to main content
Log in

The impact of feature reduction techniques on defect prediction models

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Defect prediction is an important task for preserving software quality. Most prior work on defect prediction uses software features, such as the number of lines of code, to predict whether a file or commit will be defective in the future. There are several reasons to keep the number of features that are used in a defect prediction model small. For example, using a small number of features avoids the problem of multicollinearity and the so-called ‘curse of dimensionality’. Feature selection and reduction techniques can help to reduce the number of features in a model. Feature selection techniques reduce the number of features in a model by selecting the most important ones, while feature reduction techniques reduce the number of features by creating new, combined features from the original features. Several recent studies have investigated the impact of feature selection techniques on defect prediction. However, there do not exist large-scale studies in which the impact of multiple feature reduction techniques on defect prediction is investigated. In this paper, we study the impact of eight feature reduction techniques on the performance and the variance in performance of five supervised learning and five unsupervised defect prediction models. In addition, we compare the impact of the studied feature reduction techniques with the impact of the two best-performing feature selection techniques (according to prior work). The following findings are the highlights of our study: (1) The studied correlation and consistency-based feature selection techniques result in the best-performing supervised defect prediction models, while feature reduction techniques using neural network-based techniques (restricted Boltzmann machine and autoencoder) result in the best-performing unsupervised defect prediction models. In both cases, the defect prediction models that use the selected/generated features perform better than those that use the original features (in terms of AUC and performance variance). (2) Neural network-based feature reduction techniques generate features that have a small variance across both supervised and unsupervised defect prediction models. Hence, we recommend that practitioners who do not wish to choose a best-performing defect prediction model for their data use a neural network-based feature reduction technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://github.com/klainfo/ScottKnottESD

  2. Note that the ranks are slightly different from Fig. 4 due to the fact that Scott-Knott ESD is a clustering algorithm, and hence affected by the total set of input distributions. For more information see https://github.com/klainfo/ScottKnottESD.

  3. https://sailhome.cs.queensu.ca/replication/featred-vs-featsel-defectpred/

References

  • Abaei G, Rezaei Z, Selamat A (2013) Fault prediction by utilizing self-organizing map and threshold. In: Proceedings of the international conference on control system, computing and engineering (ICCSCE), IEEE, pp 465–470

  • Arora I, Tetarwal V, Saha A (2015) Open issues in software defect prediction. Procedia Comput Sci 46:906–912

    Article  Google Scholar 

  • Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng (TSE) 22(10):751–761

    Article  Google Scholar 

  • Bellman R (1957) Dynamic Programming. Princeton University Press, Princeton

    MATH  Google Scholar 

  • Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the 7th international conference on knowledge discovery and data mining, ACM, pp 245–250

  • Bishnu PS, Bhattacherjee V (2012) Software fault prediction using quad tree-based k-means clustering algorithm. IEEE Trans Knowl Data Eng 24(6):1146–1150

    Article  Google Scholar 

  • Challagulla VUB, Bastani FB, Yen IL, Paul RA (2008) Empirical assessment of machine learning based software defect prediction techniques. Int J Artif Intell Tools 17(02):389–400

    Article  Google Scholar 

  • Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng (TSE) 20(6):476–493

    Article  Google Scholar 

  • Cohen J (1988) Statistical power analysis for the behavioral sciences

  • D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of the 7th international conference on mining software repositories (MSR), IEEE, pp 31–41

  • Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176

    Article  MathSciNet  MATH  Google Scholar 

  • Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybernet 3:32–57

    Article  MathSciNet  MATH  Google Scholar 

  • Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, ACM, pp 163–174

  • Farrar DE, Glauber RR (1967) Multicollinearity in regression analysis: the problem revisited. Rev Econ Stat 49(1):92–107

    Article  Google Scholar 

  • Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Software: Practice and Experience 41(5):579–606

    Google Scholar 

  • Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the 37th international conference on software engineering (ICSE), IEEE Press, pp 789–800

  • Ghotra B, Mcintosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: Proceedings of the 14th international conference on mining software repositories (MSR), IEEE Press, pp 146–157

  • Gray AR, Macdonell SG (1999) Software metrics data analysis–exploring the relative performance of some commonly used modeling techniques. Empir Softw Eng 4 (4):297–316

    Article  Google Scholar 

  • Guo L, Cukic B, Singh H (2003) Predicting fault prone modules by the dempster-shafer belief networks. In: Proceedings of the 18th international conference on automated software engineering (ASE), IEEE, pp 249–252

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18

    Article  Google Scholar 

  • Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato Hamilton

  • Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447

    Article  Google Scholar 

  • Halstead MH (1977) Elements of Software Science, vol 7. Elsevier, New York

    MATH  Google Scholar 

  • Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of the international workshop on artificial neural networks, Springer, pp 195–201

  • Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108

    MATH  Google Scholar 

  • Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering (ICSE), IEEE Computer Society, pp 78–88

  • He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199

    Article  Google Scholar 

  • Herbold S (2013) Training data selection for cross-project defect prediction. In: Proceedings of the 9th international conference on predictive models in software engineering, ACM, p 6

  • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  • Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma 2015. Article ID 198363, 13 pp

  • Ho TK (1995) Random decision forests. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1. IEEE, pp 278–282

  • Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, ACM, p 9

  • Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: an Introduction to Cluster Analysis, vol 344. Wiley, Hoboken

    Google Scholar 

  • Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE), IEEE Computer Society, pp 489–498

  • Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Article  Google Scholar 

  • Kuhn M (2015) Caret: classification and regression training. Astrophysics Source Code Library

  • Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59 (1):161–205

    Article  MATH  Google Scholar 

  • van der Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15(1):3221–3245

    MathSciNet  MATH  Google Scholar 

  • van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605

    MATH  Google Scholar 

  • Martinetz T, Schulten K (1991) A “neural-gas” network learns topologies. Artificial Neural Networks 1:397–402

    Google Scholar 

  • McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng (TSE) SE-2 (4):308–320

    Article  MathSciNet  MATH  Google Scholar 

  • McDonald JH (2014) Handbook of Biological Statistics, 3rd edn. Sparky House Publishing, Baltimore

    Google Scholar 

  • Menzies T, Greenwald J, Frank A (2007a) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng (TSE) 33(1):2–13

    Article  Google Scholar 

  • Menzies T, Owen D, Richardson J (2007b) The strangest thing about software. Computer 40(1):54–60

    Article  Google Scholar 

  • Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering (ICSE), IEEE, pp 181–190

  • Muthukumaran K, Rallapalli A, Murthy N (2015) Impact of feature selection techniques on bug prediction models. In: Proceedings of the 8th India software engineering conference, ACM, pp 120–129

  • Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering (ICSE), ACM, pp 452–461

  • Nam J (2014) Survey on software defect prediction. HKUST PhD Qualifying Examination, Department of Compter Science and Engineerning. The Hong Kong University of Science and Technology, Tech. Rep

  • Nam J, Fu W, Kim S, Menzies T, Tan L (2017) Heterogeneous defect prediction. IEEE Transactions on Software Engineering

  • Nam J, Kim S (2015) CLAMI: defect prediction on unlabeled datasets. In: Proceedings of the 30th international conference on automated software engineering (ASE), IEEE, pp 452–463

  • Nam J, Kim S (2015) Heterogeneous defect prediction. In: Proceedings of the 10th joint meeting on foundations of software engineering (FSE), ACM, pp 508–519

  • Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering (ICSE), IEEE Press, pp 382–391

  • Neumann DE (2002) An enhanced neural network technique for software risk analysis. IEEE Trans Softw Eng (TSE) 28(9):904–912

    Article  Google Scholar 

  • Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210

    Article  Google Scholar 

  • Peters F, Menzies T, Gong L, Zhang H (2013) Balancing privacy and utility in cross-company defect prediction. IEEE Trans Softw Eng 39(8):1054–1068

    Article  Google Scholar 

  • Petric J., Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the NASA software defect data sets. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering, ACM, pp 1–5

  • Quinlan R (1993) C4.5: programs for machine learning. morgan kaufmann publishers

  • Rathore SS, Gupta A (2014) A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction. In: Proceedings of the 7th India software engineering conference, ACM, p 7

  • Ren J, Qin K, Ma Y, Luo G (2014) On software defect prediction using machine learning. J Appl Math 2014. Article ID 785435, 8 pp

  • Rodríguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J (2007a) Detecting fault modules applying feature selection to classifiers. In: Proceedings of the 2007 international conference on information reuse and integration, IEEE, pp 667–672

  • Rodriguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J, Garre M (2007b) Attribute selection in software engineering datasets for detecting fault modules. In: Proceedings of the 2007 EUROMICRO conference on software engineering and advanced applications, IEEE, pp 418–423

  • Rokach L, Maimon O (2005) Clustering methods. In: Data mining and knowledge discovery handbook, Springer, pp 321–352

  • Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng (TSE) 39(9):1208–1215

    Article  Google Scholar 

  • Shihab E (2014) Practical software quality prediction. In: Proceedings of the 2014 international conference on software maintenance and evolution (ICSME), IEEE, pp 639–644

  • Shihab E, Jiang ZM, Ibrahim WM, Adams B, Hassan AE (2010) Understanding the impact of code and process metrics on post-release defects: A case study on the Eclipse project. In: Proceedings of the international symposium on empirical software engineering and measurement (ESEM), ACM, pp 4:1–4:10

  • Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng (TSE) 39(4):552–569

    Article  Google Scholar 

  • Smolensky P (1986) Information processing in dynamical systems: Foundations of harmony theory. Tech. rep., DTIC Document

  • Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: Pitfalls and challenges. In: Proceedings of the 40th international conference on software engineering: software engineering in practice track (ICSE-SEIP), ACM, pp 286–295

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 321–332

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng (TSE) 43(1):1–18

    Article  Google Scholar 

  • Tassey G (2002) The economic impacts of inadequate infrastructure for software testing. National Institute of Standards and Technology

  • Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17 (4):395–416

    Article  MathSciNet  Google Scholar 

  • Xu Z, Liu J, Yang Z, An G, Jia X (2016) The impact of feature selection on defect prediction performance: an empirical comparison. In: Proceedings of the 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 309–320

  • Yang B, Yin Q, Xu S, Guo P (2008) Software quality prediction using affinity propagation algorithm. In: Proceedings of the international joint conference on neural networks, IEEE, pp 1891–1896

  • Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 309–320

  • Zhang H (2004) The optimality of Naive Bayes. In: FLAIRS conference, AAAI press

  • Zhong S, Khoshgoftaar TM, Seliya N (2004) Unsupervised learning for expert-based software quality estimation. In: HASE, Citeseer, pp 149–155

  • Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC-FSE), ACM, pp 91–100

  • Zwillinger D, Kokoska S (1999) CRC Standard Probability and Statistics Tables and Formulae. Crc Press, Boca Raton

    Book  MATH  Google Scholar 

Download references

Acknowledgment

This work was partially supported by NSERC as well as JSPS KAKENHI (Grant Numbers: JP16K12415 and JP18H03222).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masanari Kondo.

Additional information

Communicated by: Federica Sarro

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kondo, M., Bezemer, CP., Kamei, Y. et al. The impact of feature reduction techniques on defect prediction models. Empir Software Eng 24, 1925–1963 (2019). https://doi.org/10.1007/s10664-018-9679-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-018-9679-5

Keywords

Navigation