The impact of feature reduction techniques on defect prediction models

Kondo, Masanari; Bezemer, Cor-Paul; Kamei, Yasutaka; Hassan, Ahmed E.; Mizuno, Osamu

doi:10.1007/s10664-018-9679-5

The impact of feature reduction techniques on defect prediction models

Published: 22 January 2019

Volume 24, pages 1925–1963, (2019)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Masanari Kondo ORCID: orcid.org/0000-0002-6317-7001¹,
Cor-Paul Bezemer²,
Yasutaka Kamei³,
Ahmed E. Hassan⁴ &
…
Osamu Mizuno¹

1991 Accesses
5 Altmetric
Explore all metrics

Abstract

Defect prediction is an important task for preserving software quality. Most prior work on defect prediction uses software features, such as the number of lines of code, to predict whether a file or commit will be defective in the future. There are several reasons to keep the number of features that are used in a defect prediction model small. For example, using a small number of features avoids the problem of multicollinearity and the so-called ‘curse of dimensionality’. Feature selection and reduction techniques can help to reduce the number of features in a model. Feature selection techniques reduce the number of features in a model by selecting the most important ones, while feature reduction techniques reduce the number of features by creating new, combined features from the original features. Several recent studies have investigated the impact of feature selection techniques on defect prediction. However, there do not exist large-scale studies in which the impact of multiple feature reduction techniques on defect prediction is investigated. In this paper, we study the impact of eight feature reduction techniques on the performance and the variance in performance of five supervised learning and five unsupervised defect prediction models. In addition, we compare the impact of the studied feature reduction techniques with the impact of the two best-performing feature selection techniques (according to prior work). The following findings are the highlights of our study: (1) The studied correlation and consistency-based feature selection techniques result in the best-performing supervised defect prediction models, while feature reduction techniques using neural network-based techniques (restricted Boltzmann machine and autoencoder) result in the best-performing unsupervised defect prediction models. In both cases, the defect prediction models that use the selected/generated features perform better than those that use the original features (in terms of AUC and performance variance). (2) Neural network-based feature reduction techniques generate features that have a small variance across both supervised and unsupervised defect prediction models. Hence, we recommend that practitioners who do not wish to choose a best-performing defect prediction model for their data use a neural network-based feature reduction technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Software Defect Prediction: An ML Approach-Based Comprehensive Study

The impact of unsupervised feature selection techniques on the performance and interpretation of defect prediction models

Article 16 April 2025

Hybrid Classifier for Software Defect Prediction by Using Filter-Based Feature Selection Techniques

Notes

https://github.com/klainfo/ScottKnottESD
Note that the ranks are slightly different from Fig. 4 due to the fact that Scott-Knott ESD is a clustering algorithm, and hence affected by the total set of input distributions. For more information see https://github.com/klainfo/ScottKnottESD.
https://sailhome.cs.queensu.ca/replication/featred-vs-featsel-defectpred/

References

Abaei G, Rezaei Z, Selamat A (2013) Fault prediction by utilizing self-organizing map and threshold. In: Proceedings of the international conference on control system, computing and engineering (ICCSCE), IEEE, pp 465–470
Arora I, Tetarwal V, Saha A (2015) Open issues in software defect prediction. Procedia Comput Sci 46:906–912
Article Google Scholar
Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng (TSE) 22(10):751–761
Article Google Scholar
Bellman R (1957) Dynamic Programming. Princeton University Press, Princeton
MATH Google Scholar
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the 7th international conference on knowledge discovery and data mining, ACM, pp 245–250
Bishnu PS, Bhattacherjee V (2012) Software fault prediction using quad tree-based k-means clustering algorithm. IEEE Trans Knowl Data Eng 24(6):1146–1150
Article Google Scholar
Challagulla VUB, Bastani FB, Yen IL, Paul RA (2008) Empirical assessment of machine learning based software defect prediction techniques. Int J Artif Intell Tools 17(02):389–400
Article Google Scholar
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng (TSE) 20(6):476–493
Article Google Scholar
Cohen J (1988) Statistical power analysis for the behavioral sciences
D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of the 7th international conference on mining software repositories (MSR), IEEE, pp 31–41
Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176
Article MathSciNet MATH Google Scholar
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybernet 3:32–57
Article MathSciNet MATH Google Scholar
Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, ACM, pp 163–174
Farrar DE, Glauber RR (1967) Multicollinearity in regression analysis: the problem revisited. Rev Econ Stat 49(1):92–107
Article Google Scholar
Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Software: Practice and Experience 41(5):579–606
Google Scholar
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the 37th international conference on software engineering (ICSE), IEEE Press, pp 789–800
Ghotra B, Mcintosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: Proceedings of the 14th international conference on mining software repositories (MSR), IEEE Press, pp 146–157
Gray AR, Macdonell SG (1999) Software metrics data analysis–exploring the relative performance of some commonly used modeling techniques. Empir Softw Eng 4 (4):297–316
Article Google Scholar
Guo L, Cukic B, Singh H (2003) Predicting fault prone modules by the dempster-shafer belief networks. In: Proceedings of the 18th international conference on automated software engineering (ASE), IEEE, pp 249–252
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18
Article Google Scholar
Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato Hamilton
Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447
Article Google Scholar
Halstead MH (1977) Elements of Software Science, vol 7. Elsevier, New York
MATH Google Scholar
Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of the international workshop on artificial neural networks, Springer, pp 195–201
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108
MATH Google Scholar
Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering (ICSE), IEEE Computer Society, pp 78–88
He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199
Article Google Scholar
Herbold S (2013) Training data selection for cross-project defect prediction. In: Proceedings of the 9th international conference on predictive models in software engineering, ACM, p 6
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma 2015. Article ID 198363, 13 pp
Ho TK (1995) Random decision forests. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1. IEEE, pp 278–282
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, ACM, p 9
Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106
Article Google Scholar
Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: an Introduction to Cluster Analysis, vol 344. Wiley, Hoboken
Google Scholar
Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE), IEEE Computer Society, pp 489–498
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Article Google Scholar
Kuhn M (2015) Caret: classification and regression training. Astrophysics Source Code Library
Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59 (1):161–205
Article MATH Google Scholar
van der Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15(1):3221–3245
MathSciNet MATH Google Scholar
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605
MATH Google Scholar
Martinetz T, Schulten K (1991) A “neural-gas” network learns topologies. Artificial Neural Networks 1:397–402
Google Scholar
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng (TSE) SE-2 (4):308–320
Article MathSciNet MATH Google Scholar
McDonald JH (2014) Handbook of Biological Statistics, 3rd edn. Sparky House Publishing, Baltimore
Google Scholar
Menzies T, Greenwald J, Frank A (2007a) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng (TSE) 33(1):2–13
Article Google Scholar
Menzies T, Owen D, Richardson J (2007b) The strangest thing about software. Computer 40(1):54–60
Article Google Scholar
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering (ICSE), IEEE, pp 181–190
Muthukumaran K, Rallapalli A, Murthy N (2015) Impact of feature selection techniques on bug prediction models. In: Proceedings of the 8th India software engineering conference, ACM, pp 120–129
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering (ICSE), ACM, pp 452–461
Nam J (2014) Survey on software defect prediction. HKUST PhD Qualifying Examination, Department of Compter Science and Engineerning. The Hong Kong University of Science and Technology, Tech. Rep
Nam J, Fu W, Kim S, Menzies T, Tan L (2017) Heterogeneous defect prediction. IEEE Transactions on Software Engineering
Nam J, Kim S (2015) CLAMI: defect prediction on unlabeled datasets. In: Proceedings of the 30th international conference on automated software engineering (ASE), IEEE, pp 452–463
Nam J, Kim S (2015) Heterogeneous defect prediction. In: Proceedings of the 10th joint meeting on foundations of software engineering (FSE), ACM, pp 508–519
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering (ICSE), IEEE Press, pp 382–391
Neumann DE (2002) An enhanced neural network technique for software risk analysis. IEEE Trans Softw Eng (TSE) 28(9):904–912
Article Google Scholar
Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210
Article Google Scholar
Peters F, Menzies T, Gong L, Zhang H (2013) Balancing privacy and utility in cross-company defect prediction. IEEE Trans Softw Eng 39(8):1054–1068
Article Google Scholar
Petric J., Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the NASA software defect data sets. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering, ACM, pp 1–5
Quinlan R (1993) C4.5: programs for machine learning. morgan kaufmann publishers
Rathore SS, Gupta A (2014) A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction. In: Proceedings of the 7th India software engineering conference, ACM, p 7
Ren J, Qin K, Ma Y, Luo G (2014) On software defect prediction using machine learning. J Appl Math 2014. Article ID 785435, 8 pp
Rodríguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J (2007a) Detecting fault modules applying feature selection to classifiers. In: Proceedings of the 2007 international conference on information reuse and integration, IEEE, pp 667–672
Rodriguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J, Garre M (2007b) Attribute selection in software engineering datasets for detecting fault modules. In: Proceedings of the 2007 EUROMICRO conference on software engineering and advanced applications, IEEE, pp 418–423
Rokach L, Maimon O (2005) Clustering methods. In: Data mining and knowledge discovery handbook, Springer, pp 321–352
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng (TSE) 39(9):1208–1215
Article Google Scholar
Shihab E (2014) Practical software quality prediction. In: Proceedings of the 2014 international conference on software maintenance and evolution (ICSME), IEEE, pp 639–644
Shihab E, Jiang ZM, Ibrahim WM, Adams B, Hassan AE (2010) Understanding the impact of code and process metrics on post-release defects: A case study on the Eclipse project. In: Proceedings of the international symposium on empirical software engineering and measurement (ESEM), ACM, pp 4:1–4:10
Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng (TSE) 39(4):552–569
Article Google Scholar
Smolensky P (1986) Information processing in dynamical systems: Foundations of harmony theory. Tech. rep., DTIC Document
Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: Pitfalls and challenges. In: Proceedings of the 40th international conference on software engineering: software engineering in practice track (ICSE-SEIP), ACM, pp 286–295
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 321–332
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng (TSE) 43(1):1–18
Article Google Scholar
Tassey G (2002) The economic impacts of inadequate infrastructure for software testing. National Institute of Standards and Technology
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17 (4):395–416
Article MathSciNet Google Scholar
Xu Z, Liu J, Yang Z, An G, Jia X (2016) The impact of feature selection on defect prediction performance: an empirical comparison. In: Proceedings of the 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 309–320
Yang B, Yin Q, Xu S, Guo P (2008) Software quality prediction using affinity propagation algorithm. In: Proceedings of the international joint conference on neural networks, IEEE, pp 1891–1896
Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 309–320
Zhang H (2004) The optimality of Naive Bayes. In: FLAIRS conference, AAAI press
Zhong S, Khoshgoftaar TM, Seliya N (2004) Unsupervised learning for expert-based software quality estimation. In: HASE, Citeseer, pp 149–155
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC-FSE), ACM, pp 91–100
Zwillinger D, Kokoska S (1999) CRC Standard Probability and Statistics Tables and Formulae. Crc Press, Boca Raton
Book MATH Google Scholar

Download references

Acknowledgment

This work was partially supported by NSERC as well as JSPS KAKENHI (Grant Numbers: JP16K12415 and JP18H03222).

Author information

Authors and Affiliations

Software Engineering Laboratory (SEL), Kyoto Institute of Technology, Kyoto, Japan
Masanari Kondo & Osamu Mizuno
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada
Cor-Paul Bezemer
Principles of Software Languages group (POSL), Kyushu University, Fukuoka, Japan
Yasutaka Kamei
Software Analysis and Intelligence Lab (SAIL), School of Computing, Queen’s University, Kingston, ON, Canada
Ahmed E. Hassan

Authors

Masanari Kondo
View author publications
You can also search for this author inPubMed Google Scholar
Cor-Paul Bezemer
View author publications
You can also search for this author inPubMed Google Scholar
Yasutaka Kamei
View author publications
You can also search for this author inPubMed Google Scholar
Ahmed E. Hassan
View author publications
You can also search for this author inPubMed Google Scholar
Osamu Mizuno
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Masanari Kondo.

Additional information

Communicated by: Federica Sarro

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kondo, M., Bezemer, CP., Kamei, Y. et al. The impact of feature reduction techniques on defect prediction models. Empir Software Eng 24, 1925–1963 (2019). https://doi.org/10.1007/s10664-018-9679-5

Download citation

Published: 22 January 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s10664-018-9679-5

Keywords

Profiles

Cor-Paul Bezemer View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The impact of feature reduction techniques on defect prediction models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Software Defect Prediction: An ML Approach-Based Comprehensive Study

The impact of unsupervised feature selection techniques on the performance and interpretation of defect prediction models

Hybrid Classifier for Software Defect Prediction by Using Filter-Based Feature Selection Techniques

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now