Skip to main content

Outlier Mining Techniques for Software Defect Prediction

  • Conference paper
  • First Online:
Software Quality: Higher Software Quality through Zero Waste Development (SWQD 2023)

Abstract

Using software metrics as a method of quantification of software, various approaches were proposed for locating defect-prone source code units within software projects. Most of these approaches rely on supervised learning algorithms, which require labeled data for adjusting their parameters during the learning phase. Usually, such labeled training data is not available. Unsupervised algorithms do not require training data and can therefore help to overcome this limitation.

In this work, we evaluate the effect of unsupervised learning by means of cluster-based algorithms and outlier mining algorithms for the task of defect prediction, i.e., locating defect-prone source code units. We investigate the effect of various class balancing and feature compressing techniques as preprocessing steps and show how sliding windows can be used to capture time series of source code metrics. We evaluate the Isolation Forest and Local Outlier Factor, as representants of outlier mining techniques. Our experiments on three publicly available datasets, containing a total of 11 software projects, indicate that the consideration of time series can improve static examinations by up to 3%. The results further show that supervised algorithms can outperform unsupervised approaches on all projects. Among all unsupervised approaches, the Isolation Forest achieves the best accuracy on 10 out of 11 projects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://scikit-learn.org.

  2. 2.

    https://www.tensorflow.org/ and https://keras.io/.

  3. 3.

    For completeness, here, we also evaluated the possibility to use no Balancing or no Feature Compression technique. Those results are—as expected—weaker (cf. auxiliary material).

References

  1. Adam, S.P., Alexandropoulos, S.-A.N., Pardalos, P.M., Vrahatis, M.N.: No free lunch theorem: a review. In: Demetriou, I.C., Pardalos, P.M. (eds.) Approximation and Optimization. SOIA, vol. 145, pp. 57–82. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12767-1_5

    Chapter  Google Scholar 

  2. Albahli, S.: A deep ensemble learning method for effort-aware just-in-time defect prediction. Future Internet 11(12), 246 (2019). https://doi.org/10.3390/fi11120246

    Article  Google Scholar 

  3. Amasaki, S.: Cross-version defect prediction using cross-project defect prediction approaches: does it work? In: Proc. 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2018), pp. 32–41. ACM (2018). https://doi.org/10.1145/3273934.3273938

  4. Bennin, K.E., Keung, J., Monden, A., Kamei, Y., Ubayashi, N.: Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: Proc. 40th Annual Computer Software and Applications Conference (COMPSAC 2016), pp. 154–163. IEEE (2016). https://doi.org/10.1109/COMPSAC.2016.144

  5. Bennin, K.E., Keung, J., Phannachitta, P., Monden, A., Mensah, S.: Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 44(6), 534–550 (2018). https://doi.org/10.1109/TSE.2017.2731766

    Article  Google Scholar 

  6. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(10), 281–305 (2012). https://jmlr.org/papers/v13/bergstra12a.html

    MathSciNet  MATH  Google Scholar 

  7. Bowes, D., Hall, T., Gray, D.: Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix. In: Proc. 8th International Conference on Predictive Models in Software Engineering (PROMISE 2012), pp. 109–118. ACM (2012). https://doi.org/10.1145/2365324.2365338

  8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002). https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  9. Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006). https://doi.org/10.1007/s10115-006-0003-0

    Article  Google Scholar 

  10. Ding, Z., Fei, M.: An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc. Vol. 46(20), 12–17 (2013). https://doi.org/10.3182/20130902-3-CN-3020.00044

    Article  Google Scholar 

  11. Ferenc, R., Tóth, Z., Ladányi, G., Siket, I., Gyimóthy, T.: A public unified bug dataset for java. In: Proc. 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2018), pp. 12–21. ACM (2018). https://doi.org/10.1145/3273934.3273936

  12. Fu, W., Menzies, T.: Revisiting unsupervised learning for defect prediction. In: Proc. 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017), pp. 72–83. ACM (2017). https://doi.org/10.1145/3106237.3106257

  13. Hawkins, D.M.: The problem of overfitting. J. Chem. Inf. Comput. Sci. 44(1), 1–12 (2004). https://doi.org/10.1021/ci0342472

    Article  Google Scholar 

  14. He, Z., Fan, B., Cheng, T., Wang, S.Y., Tan, C.H.: A mean-shift algorithm for large-scale planar maximal covering location problems. Eur. J. Oper. Res. 250(1), 65–76 (2016). https://doi.org/10.1016/j.ejor.2015.09.006

    Article  MathSciNet  MATH  Google Scholar 

  15. Hemmati, H., et al.: The MSR cookbook: mining a decade of research. In: Proc. 10th Working Conference on Mining Software Repositories (MSR 2013), pp. 343–352. IEEE (2013). https://doi.org/10.1109/MSR.2013.6624048

  16. Huang, Q., Xia, X., Lo, D.: Supervised vs unsupervised models: a holistic look at effort-aware just-in-time defect prediction. In: Proc. International Conference on Software Maintenance and Evolution (ICSME 2017), pp. 159–170. IEEE (2017). https://doi.org/10.1109/ICSME.2017.51

  17. Jiarpakdee, J., Tantithamthavorn, C., Hassan, A.E.: The impact of correlated metrics on the interpretation of defect models. IEEE Trans. Softw. Eng. 47(2), 320–331 (2021). https://doi.org/10.1109/TSE.2019.2891758

    Article  Google Scholar 

  18. Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: Proc. 6th International Conference on Predictive Models in Software Engineering (PROMISE 2010). ACM (2010). https://doi.org/10.1145/1868328.1868342

  19. Kondo, M., Bezemer, C.-P., Kamei, Y., Hassan, A.E., Mizuno, O.: The impact of feature reduction techniques on defect prediction models. Empir. Softw. Eng. 24(4), 1925–1963 (2019). https://doi.org/10.1007/s10664-018-9679-5

    Article  Google Scholar 

  20. Li, N., Shepperd, M., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. 122, 106287 (2020). https://doi.org/10.1016/j.infsof.2020.106287

    Article  Google Scholar 

  21. Li, Z., Jing, X.Y., Zhu, X.: Progress on approaches to software defect prediction. IET Softw. 12(3), 161–175 (2018). https://doi.org/10.1049/iet-sen.2017.0148

    Article  Google Scholar 

  22. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Proc. 8th International Conference on Data Mining (ICDM 2008), pp. 413–422. IEEE (2008). https://doi.org/10.1109/ICDM.2008.17

  23. Liu, Y., Li, Y., Guo, J., Zhou, Y., Xu, B.: Connecting software metrics across versions to predict defects. In: Proc. 25th International Conference on Software Analysis, Evolution and Reengineering (SANER 2018), pp. 232–243. IEEE (2018). https://doi.org/10.1109/SANER.2018.8330212

  24. Mahmood, Z., Bowes, D., Lane, P.C.R., Hall, T.: What is the impact of imbalance on software defect prediction performance? In: Proc. 11th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2015), pp. 1–4. ACM (2015). https://doi.org/10.1145/2810146.2810150

  25. Mende, T.: Replication of defect prediction studies: problems, pitfalls and recommendations. In: Proc. 6th International Conference on Predictive Models in Software Engineering (PROMISE 2010), pp. 1–10. ACM (2010). https://doi.org/10.1145/1868328.1868336

  26. Miles, J.: Tolerance and Variance Inflation Factor. Wiley (2014). https://doi.org/10.1002/9781118445112.stat06593

  27. Moshtari, S., Santos, J.C., Mirakhorli, M., Okutan, A.: Looking for software defects? First find the nonconformists. In: Proc. 20th International Working Conference on Source Code Analysis and Manipulation (SCAM 2020), pp. 75–86. IEEE (2020). https://doi.org/10.1109/SCAM51674.2020.00014

  28. Nagappan, N., Zeller, A., Zimmermann, T., Herzig, K., Murphy, B.: Change bursts as defect predictors. In: Proc. 21st International Symposium on Software Reliability Engineering (ISSRE 2010), pp. 309–318. IEEE (2010). https://doi.org/10.1109/ISSRE.2010.25

  29. Nam, J., Kim, S.: CLAMI: defect prediction on unlabeled datasets. In: Proc. 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015), pp. 452–463 (2015). https://doi.org/10.1109/ASE.2015.56

  30. Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)

    MathSciNet  Google Scholar 

  31. Rathore, S.S., Kumar, S.: A study on software fault prediction techniques. Artif. Intell. Rev. 51(2), 255–327 (2017). https://doi.org/10.1007/s10462-017-9563-5

    Article  Google Scholar 

  32. Runeson, P.: A survey of unit testing practices. IEEE Softw. 23(4), 22–29 (2006). https://doi.org/10.1109/MS.2006.91

    Article  Google Scholar 

  33. Saravanan, R., Sujatha, P.: A state of art techniques on machine learning algorithms: a perspective of supervised learning approaches in data classification. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 945–949 (2018). https://doi.org/10.1109/ICCONS.2018.8663155

  34. Sayyad Shirabad, J., Menzies, T.: The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada (2005)

    Google Scholar 

  35. Tantithamthavorn, C., Hassan, A.E., Matsumoto, K.: The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Trans. Softw. Eng. 46(11), 1200–1219 (2020). https://doi.org/10.1109/TSE.2018.2876537

    Article  Google Scholar 

  36. Verleysen, M., François, D.: The curse of dimensionality in data mining and time series prediction. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) IWANN 2005. LNCS, vol. 3512, pp. 758–770. Springer, Heidelberg (2005). https://doi.org/10.1007/11494669_93

    Chapter  Google Scholar 

  37. Xu, Z., et al.: Clustering-based unsupervised models, data analytics for defect prediction, empirical study. J. Syst. Softw. 172, 110862 (2021). https://doi.org/10.1016/j.jss.2020.110862

    Article  Google Scholar 

  38. Yan, M., Fang, Y., Lo, D., Xia, X., Zhang, X.: File-level defect prediction: unsupervised vs. supervised models. In: Proc. ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2017), pp. 344–353. IEE/ACM (2017). https://doi.org/10.1109/ESEM.2017.48

  39. Yang, Y., et al.: Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proc. 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016), pp. 157–168. ACM (2016). https://doi.org/10.1145/2950290.2950353

  40. Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proc. IEEE/ACM 38th International Conference on Software Engineering (ICSE 2016), pp. 309–320 (2016). https://doi.org/10.1145/2884781.2884839

  41. Zhu, K., Zhang, N., Ying, S., Zhu, D.: Within-project and cross-project just-in-time defect prediction based on denoising autoencoder and convolutional neural network. IET Softw. 14(3), 185–195 (2020). https://doi.org/10.1049/iet-sen.2019.0278

    Article  Google Scholar 

  42. Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proc. 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2009), pp. 91–100. ACM (2009). https://doi.org/10.1145/1595696.1595713

Download references

Acknowledgement

We thank the anonymous reviewers for their valuable feedback. This work was partially funded by the German Ministry for Education and Research (BMBF) through grants 01IS20088B (“KnowhowAnalyzer”) and 01IS22062 (“AI research group FFS-AI”).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tim Cech .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cech, T., Atzberger, D., Scheibel, W., Misra, S., Döllner, J. (2023). Outlier Mining Techniques for Software Defect Prediction. In: Mendez, D., Winkler, D., Kross, J., Biffl, S., Bergsmann, J. (eds) Software Quality: Higher Software Quality through Zero Waste Development. SWQD 2023. Lecture Notes in Business Information Processing, vol 472. Springer, Cham. https://doi.org/10.1007/978-3-031-31488-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-31488-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-31487-2

  • Online ISBN: 978-3-031-31488-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics