Outlier Mining Techniques for Software Defect Prediction

Cech, Tim; Atzberger, Daniel; Scheibel, Willy; Misra, Sanjay; Döllner, Jürgen

doi:10.1007/978-3-031-31488-9_3

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 472))

Included in the following conference series:

International Conference on Software Quality

293 Accesses
1 Citations

Abstract

Using software metrics as a method of quantification of software, various approaches were proposed for locating defect-prone source code units within software projects. Most of these approaches rely on supervised learning algorithms, which require labeled data for adjusting their parameters during the learning phase. Usually, such labeled training data is not available. Unsupervised algorithms do not require training data and can therefore help to overcome this limitation.

In this work, we evaluate the effect of unsupervised learning by means of cluster-based algorithms and outlier mining algorithms for the task of defect prediction, i.e., locating defect-prone source code units. We investigate the effect of various class balancing and feature compressing techniques as preprocessing steps and show how sliding windows can be used to capture time series of source code metrics. We evaluate the Isolation Forest and Local Outlier Factor, as representants of outlier mining techniques. Our experiments on three publicly available datasets, containing a total of 11 software projects, indicate that the consideration of time series can improve static examinations by up to 3%. The results further show that supervised algorithms can outperform unsupervised approaches on all projects. Among all unsupervised approaches, the Isolation Forest achieves the best accuracy on 10 out of 11 projects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://scikit-learn.org.
2.
https://www.tensorflow.org/ and https://keras.io/.
3.
For completeness, here, we also evaluated the possibility to use no Balancing or no Feature Compression technique. Those results are—as expected—weaker (cf. auxiliary material).

References

Adam, S.P., Alexandropoulos, S.-A.N., Pardalos, P.M., Vrahatis, M.N.: No free lunch theorem: a review. In: Demetriou, I.C., Pardalos, P.M. (eds.) Approximation and Optimization. SOIA, vol. 145, pp. 57–82. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12767-1_5
Chapter Google Scholar
Albahli, S.: A deep ensemble learning method for effort-aware just-in-time defect prediction. Future Internet 11(12), 246 (2019). https://doi.org/10.3390/fi11120246
Article Google Scholar
Amasaki, S.: Cross-version defect prediction using cross-project defect prediction approaches: does it work? In: Proc. 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2018), pp. 32–41. ACM (2018). https://doi.org/10.1145/3273934.3273938
Bennin, K.E., Keung, J., Monden, A., Kamei, Y., Ubayashi, N.: Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: Proc. 40th Annual Computer Software and Applications Conference (COMPSAC 2016), pp. 154–163. IEEE (2016). https://doi.org/10.1109/COMPSAC.2016.144
Bennin, K.E., Keung, J., Phannachitta, P., Monden, A., Mensah, S.: Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 44(6), 534–550 (2018). https://doi.org/10.1109/TSE.2017.2731766
Article Google Scholar
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(10), 281–305 (2012). https://jmlr.org/papers/v13/bergstra12a.html
MathSciNet MATH Google Scholar
Bowes, D., Hall, T., Gray, D.: Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix. In: Proc. 8th International Conference on Predictive Models in Software Engineering (PROMISE 2012), pp. 109–118. ACM (2012). https://doi.org/10.1145/2365324.2365338
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002). https://doi.org/10.1613/jair.953
Article MATH Google Scholar
Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006). https://doi.org/10.1007/s10115-006-0003-0
Article Google Scholar
Ding, Z., Fei, M.: An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc. Vol. 46(20), 12–17 (2013). https://doi.org/10.3182/20130902-3-CN-3020.00044
Article Google Scholar
Ferenc, R., Tóth, Z., Ladányi, G., Siket, I., Gyimóthy, T.: A public unified bug dataset for java. In: Proc. 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2018), pp. 12–21. ACM (2018). https://doi.org/10.1145/3273934.3273936
Fu, W., Menzies, T.: Revisiting unsupervised learning for defect prediction. In: Proc. 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017), pp. 72–83. ACM (2017). https://doi.org/10.1145/3106237.3106257
Hawkins, D.M.: The problem of overfitting. J. Chem. Inf. Comput. Sci. 44(1), 1–12 (2004). https://doi.org/10.1021/ci0342472
Article Google Scholar
He, Z., Fan, B., Cheng, T., Wang, S.Y., Tan, C.H.: A mean-shift algorithm for large-scale planar maximal covering location problems. Eur. J. Oper. Res. 250(1), 65–76 (2016). https://doi.org/10.1016/j.ejor.2015.09.006
Article MathSciNet MATH Google Scholar
Hemmati, H., et al.: The MSR cookbook: mining a decade of research. In: Proc. 10th Working Conference on Mining Software Repositories (MSR 2013), pp. 343–352. IEEE (2013). https://doi.org/10.1109/MSR.2013.6624048
Huang, Q., Xia, X., Lo, D.: Supervised vs unsupervised models: a holistic look at effort-aware just-in-time defect prediction. In: Proc. International Conference on Software Maintenance and Evolution (ICSME 2017), pp. 159–170. IEEE (2017). https://doi.org/10.1109/ICSME.2017.51
Jiarpakdee, J., Tantithamthavorn, C., Hassan, A.E.: The impact of correlated metrics on the interpretation of defect models. IEEE Trans. Softw. Eng. 47(2), 320–331 (2021). https://doi.org/10.1109/TSE.2019.2891758
Article Google Scholar
Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: Proc. 6th International Conference on Predictive Models in Software Engineering (PROMISE 2010). ACM (2010). https://doi.org/10.1145/1868328.1868342
Kondo, M., Bezemer, C.-P., Kamei, Y., Hassan, A.E., Mizuno, O.: The impact of feature reduction techniques on defect prediction models. Empir. Softw. Eng. 24(4), 1925–1963 (2019). https://doi.org/10.1007/s10664-018-9679-5
Article Google Scholar
Li, N., Shepperd, M., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. 122, 106287 (2020). https://doi.org/10.1016/j.infsof.2020.106287
Article Google Scholar
Li, Z., Jing, X.Y., Zhu, X.: Progress on approaches to software defect prediction. IET Softw. 12(3), 161–175 (2018). https://doi.org/10.1049/iet-sen.2017.0148
Article Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Proc. 8th International Conference on Data Mining (ICDM 2008), pp. 413–422. IEEE (2008). https://doi.org/10.1109/ICDM.2008.17
Liu, Y., Li, Y., Guo, J., Zhou, Y., Xu, B.: Connecting software metrics across versions to predict defects. In: Proc. 25th International Conference on Software Analysis, Evolution and Reengineering (SANER 2018), pp. 232–243. IEEE (2018). https://doi.org/10.1109/SANER.2018.8330212
Mahmood, Z., Bowes, D., Lane, P.C.R., Hall, T.: What is the impact of imbalance on software defect prediction performance? In: Proc. 11th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2015), pp. 1–4. ACM (2015). https://doi.org/10.1145/2810146.2810150
Mende, T.: Replication of defect prediction studies: problems, pitfalls and recommendations. In: Proc. 6th International Conference on Predictive Models in Software Engineering (PROMISE 2010), pp. 1–10. ACM (2010). https://doi.org/10.1145/1868328.1868336
Miles, J.: Tolerance and Variance Inflation Factor. Wiley (2014). https://doi.org/10.1002/9781118445112.stat06593
Moshtari, S., Santos, J.C., Mirakhorli, M., Okutan, A.: Looking for software defects? First find the nonconformists. In: Proc. 20th International Working Conference on Source Code Analysis and Manipulation (SCAM 2020), pp. 75–86. IEEE (2020). https://doi.org/10.1109/SCAM51674.2020.00014
Nagappan, N., Zeller, A., Zimmermann, T., Herzig, K., Murphy, B.: Change bursts as defect predictors. In: Proc. 21st International Symposium on Software Reliability Engineering (ISSRE 2010), pp. 309–318. IEEE (2010). https://doi.org/10.1109/ISSRE.2010.25
Nam, J., Kim, S.: CLAMI: defect prediction on unlabeled datasets. In: Proc. 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015), pp. 452–463 (2015). https://doi.org/10.1109/ASE.2015.56
Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)
MathSciNet Google Scholar
Rathore, S.S., Kumar, S.: A study on software fault prediction techniques. Artif. Intell. Rev. 51(2), 255–327 (2017). https://doi.org/10.1007/s10462-017-9563-5
Article Google Scholar
Runeson, P.: A survey of unit testing practices. IEEE Softw. 23(4), 22–29 (2006). https://doi.org/10.1109/MS.2006.91
Article Google Scholar
Saravanan, R., Sujatha, P.: A state of art techniques on machine learning algorithms: a perspective of supervised learning approaches in data classification. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 945–949 (2018). https://doi.org/10.1109/ICCONS.2018.8663155
Sayyad Shirabad, J., Menzies, T.: The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada (2005)
Google Scholar
Tantithamthavorn, C., Hassan, A.E., Matsumoto, K.: The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Trans. Softw. Eng. 46(11), 1200–1219 (2020). https://doi.org/10.1109/TSE.2018.2876537
Article Google Scholar
Verleysen, M., François, D.: The curse of dimensionality in data mining and time series prediction. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) IWANN 2005. LNCS, vol. 3512, pp. 758–770. Springer, Heidelberg (2005). https://doi.org/10.1007/11494669_93
Chapter Google Scholar
Xu, Z., et al.: Clustering-based unsupervised models, data analytics for defect prediction, empirical study. J. Syst. Softw. 172, 110862 (2021). https://doi.org/10.1016/j.jss.2020.110862
Article Google Scholar
Yan, M., Fang, Y., Lo, D., Xia, X., Zhang, X.: File-level defect prediction: unsupervised vs. supervised models. In: Proc. ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2017), pp. 344–353. IEE/ACM (2017). https://doi.org/10.1109/ESEM.2017.48
Yang, Y., et al.: Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proc. 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016), pp. 157–168. ACM (2016). https://doi.org/10.1145/2950290.2950353
Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proc. IEEE/ACM 38th International Conference on Software Engineering (ICSE 2016), pp. 309–320 (2016). https://doi.org/10.1145/2884781.2884839
Zhu, K., Zhang, N., Ying, S., Zhu, D.: Within-project and cross-project just-in-time defect prediction based on denoising autoencoder and convolutional neural network. IET Softw. 14(3), 185–195 (2020). https://doi.org/10.1049/iet-sen.2019.0278
Article Google Scholar
Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proc. 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2009), pp. 91–100. ACM (2009). https://doi.org/10.1145/1595696.1595713

Download references

Acknowledgement

We thank the anonymous reviewers for their valuable feedback. This work was partially funded by the German Ministry for Education and Research (BMBF) through grants 01IS20088B (“KnowhowAnalyzer”) and 01IS22062 (“AI research group FFS-AI”).

Author information

Authors and Affiliations

Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
Tim Cech, Daniel Atzberger, Willy Scheibel & Jürgen Döllner
Department of Computer Science and Communication, Østfold University College, Halden, Norway
Sanjay Misra

Authors

Tim Cech
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Atzberger
View author publications
You can also search for this author in PubMed Google Scholar
Willy Scheibel
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Misra
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Döllner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tim Cech .

Editor information

Editors and Affiliations

Blekinge Institute of Technology, Karlskrona, Sweden
Daniel Mendez
Austrian Center for Digital Production (CDP), SBA Research gGmbH, Vienna, Austria
Dietmar Winkler
fortiss GmbH, Munich, Germany
Johannes Kross
TU Wien, Vienna, Austria
Stefan Biffl
Software Quality Lab GmbH, Linz, Austria
Johannes Bergsmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cech, T., Atzberger, D., Scheibel, W., Misra, S., Döllner, J. (2023). Outlier Mining Techniques for Software Defect Prediction. In: Mendez, D., Winkler, D., Kross, J., Biffl, S., Bergsmann, J. (eds) Software Quality: Higher Software Quality through Zero Waste Development. SWQD 2023. Lecture Notes in Business Information Processing, vol 472. Springer, Cham. https://doi.org/10.1007/978-3-031-31488-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-31488-9_3
Published: 13 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31487-2
Online ISBN: 978-3-031-31488-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Outlier Mining Techniques for Software Defect Prediction