Software Defect Prediction on Unlabelled Datasets: A Comparative Study

Ronchieri, Elisabetta; Canaparo, Marco; Belgiovine, Mauro

doi:10.1007/978-3-030-58802-1_25

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12250))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1590 Accesses

Abstract

Background: Defect prediction on unlabelled datasets is a challenging and widespread problem in software engineering. Machine learning is of great value in this context because it provides techniques - called unsupervised - that are applicable to unlabelled datasets. Objective: This study aims at comparing various approaches employed over the years on unlabelled datasets to predict the defective modules, i.e. the ones which need more attention in the testing phase. Our comparison is based on the measurement of performance metrics and on the real defective information derived from software archives. Our work leverages a new dataset that has been obtained by extracting and preprocessing its metrics from a C++ software. Method: Our empirical study has taken advantage of CLAMI with its improvement CLAMI+ that we have applied on high energy physics software datasets. Furthermore, we have used clustering techniques such as the K-means algorithm to find potentially critical modules. Results: Our experimental analysis have been carried out on 1 open source project with 34 software releases. We have applied 17 ML techniques to the labelled datasets obtained by following the CLAMI and CLAMI+ approaches. The two approaches have been evaluated by using different performance metrics, our results show that CLAMI+ performs better than CLAMI. The predictive average accuracy metric is around 95% for 4 ML techniques (4 out of 17) that show a Kappa statistic greater than 0.80. We applied K-means on the same dataset and obtained 2 clusters labelled according to the output of CLAMI and CLAMI+. Conclusion: Based on the results of the different statistical tests, we conclude that no significant performance differences have been found in the selected classification techniques.

Supported by organization INFN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Empirical Study on Multi-source Cross-Project Defect Prediction Using Machine Learning

A set of measures designed to identify overlapped instances in software defect prediction

Article 10 January 2017

Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction

References

Arar, O.F., Ayan, K.: Software defect prediction using cost-sensitive neural network. Appl. Softw. Comput. 33, 263–277 (2015)
Article Google Scholar
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33, 2–13 (2007)
Article Google Scholar
Ronchieri, E., Canaparo, M.: Metrics for software reliability: a systematic mapping study. J. Integr. Des. Process Sci. 22, 5–25 (2018)
Article Google Scholar
Malhotra, R., Bansal, A.J.: Cross project change prediction using open source projects. In: International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE (2014). https://doi.org/10.1109/ICACCI.2014.6968347
Zhong, S., Khoshgoftaar, T.M., Seliya, N.: Unsupervised learning for expert-based software quality estimation. In: Proceedings of the 8th IEEE International Symposium on High Assurance Systems Engineering. IEEE (2004). https://doi.org/10.1109/HASE.2004.1281739
Yang, J., Qian, H.: Defect prediction on unlabeled datasets by using unsupervised clustering. In: Proceedings of the IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (2016)
Google Scholar
Li, N., Shepperd, M.J., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. 122, 106287 (2020)
Article Google Scholar
Catal, C., Sevim, U., Diri, B.: Clustering and metrics thresholds based software fault prediction of unlabeled program modules. In: 2009 Sixth International Conference on Information Technology: New Generations, pp. 199–204 (2009)
Google Scholar
Zhong, S., Khoshgoftaar, T.M., Seliya, N.: Analyzing software measurement data with clustering techniques. IEEE Intell. Syst. 19(2), 20–27 (2004). https://doi.org/10.1109/MIS.2004.1274907
Article Google Scholar
Bishnu, P.S., Bhattacherjee, V.: Software fault prediction using quad tree-based k-means clustering algorithm. IEEE Trans. Knowl. Data Eng. (2012). https://doi.org/10.1109/TKDE.2011.163
Article Google Scholar
Aleem, S., Capretz, L.F., Ahmed, F.: Benchmarking machine learning techniques for software defect detection. Int. J. Softw. Eng. Appl. 6(3) (2015). https://doi.org/10.5121/ijsea.2015.6302
Alsawalqah, H., Hijazi, N., Eshtay, M., et al.: Software defect prediction using heterogeneous ensemble classification based on segmented patterns. Appl. Sci. 10(1745) (2020). https://doi.org/10.3390/app10051745
Yang, B., Zheng, X., Guo, P.: Software metrics data clustering for quality prediction. In: Huang, D.-S., Li, K., Irwin, G.W. (eds.) ICIC 2006. LNCS (LNAI), vol. 4114, pp. 959–964. Springer, Heidelberg (2006). https://doi.org/10.1007/978-3-540-37275-2_121
Chapter Google Scholar
Abaei, G., Selamat, A.: Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In: Lee, R. (ed.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. SCI, vol. 569, pp. 179–193. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-10389-1_13
Chapter Google Scholar
Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 309–320 (2016). https://doi.org/10.1145/2884781.2884839
Chang, R., Shen, X., Wang, B., Xu, Q.: A novel method for software defect prediction in the context of big data. In: 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), pp. 100–104 (2017). https://doi.org/10.1109/ICBDA.2017.8078785
Yan, M., Yang, M., Liu, C., Zhang, X.: Self-learning change-prone class prediction. In: The 28th International Conference on Software Engineering and Knowledge Engineering, SEKE 2016, Redwood City, San Francisco Bay, USA, 1–3 July 2016, pp. 134–140 (2016). https://doi.org/10.18293/SEKE2016-039
Park, M., Hong, E.: Software fault prediction model using clustering algorithms determining the number of clusters automatically. Int. J. Softw. Eng. Appl. 8, 199–204 (2014)
Google Scholar
Herbold, S., Trautsch, A., Grabowski, J.: A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans. Softw. Eng. 44(9), 811–833 (2017)
Article Google Scholar
Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013). https://doi.org/10.1109/TSE.2013.11
Article Google Scholar
Peters, F., Menzies, T., Gong, L., Zhang, H.: Balancing privacy and utility in cross-company defect prediction. IEEE Trans. Softw. Eng. 39(8), 1054–1068 (2013). https://doi.org/10.1109/TSE.2013.6
Article Google Scholar
Nam, J., Kim, S.: CLAMI: defect prediction on unlabeled datasets (T). In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE (2015). https://doi.org/10.1109/ASE.2015.56
Yan, M., Zhang, X., Liu, C., et al.: Automated change-prone class prediction on unlabeled dataset using unsupervised method. Inf. Softw. Technol. 92, 1–16 (2017)
Article Google Scholar
Ghotra, B., McIntosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: IEEE/ACM 37th International Conference of Software Engineering (2015). https://doi.org/10.1109/ICSE.2015.91
Agostinelli, S., Allison, J., Amako, K., et al.: GEANT4 - a simulation toolkit. Nucl. Instrum. Methods Phys. Res. Sect. A 506(3), 250–303 (2003)
Google Scholar
Ronchieri, E., Pia, M.G.: Assessing software quality in high energy and nuclear physics: the geant4 and root case studies and beyond. In: Proceedings of the IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), Sydney, Australia, Australia (2018)
Google Scholar
Imagix: Reverse Engineering Tools - C, C++, Java - Imagix. https://www.imagix.com/
Preston-Werner, T.: Semantic Versioning 2.0.0 (2013). https://semver.org/spec/v2.0.0.html
Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)
Article Google Scholar
McCabe, T.: A complexity measure. IEEE Trans. Softw. Eng. SE 2(4), 308–320 (1976)
Google Scholar
Halstead, M.H.: Elements of Software Science (1975)
Google Scholar
Chidamber, S.R., Kemerer, C.F.: Metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994)
Article Google Scholar
Zhang, H., Zhang, X.: Comments on data mining static code attributes to learn defect prediction. IEEE Trans. Softw. Eng. 33(9), 635–636 (2007)
Article Google Scholar
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Article Google Scholar
Yucalar, F., Ozcift, A., Borandag, E., Kilinc, D.: Multiple-classifiers in software quality engineering: combining predictors to improve software fault prediction ability. Eng. Sci. Technol. Int. J. (2019). https://doi.org/10.1016/j.jestch.2019.10.005
Article Google Scholar
Yan, M., Xia, X., Shihab, E., et al.: Automating change-level self-admitted technical debt determination. IEEE Trans. Softw. Eng. 45(12), 1211–1229 (2019). https://doi.org/10.1109/TSE.2018.2831232
Article Google Scholar
Garcìa, S., Fernandez, A., Luego, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180, 2044–2064 (2009)
Article Google Scholar
Friedman, M.: A comparison of alternative tests of significance for the problem of M rankings. Annal. Math. Stat. 11(1), 86–92 (1940). https://www.jstor.org/stable/2235971
Iman, R.L., Davenport, J.M.: Approximations of the critical region of the friedman statistic. Commun. Stat. 9, 571–595 (1980)
Article Google Scholar
Calvo, B., Santafé, G.: scmamp: statistical comparison of multiple algorithms in multiple problems. R J. 8(1), 248–256 (2016)
Article Google Scholar
Bergmann, B., Hommel, G.: Improvements of general multiple test procedures for redundant systems of hypotheses. Mult. Hypotheses Test. 70, 100–115 (1988)
Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IEEE Trans. Softw. Eng. 43(1), 1–18 (2017). https://doi.org/10.1109/TSE.2016.2584050
Article Google Scholar
Azeem, N., Usmani, S.: Analysis of data mining based software defect prediction techniques. Glob. J. Comput. Sci. Technol. 11 (2011)
Google Scholar
Wang, J., Ma, Y., Zhang, L., Gao, R., Wu, D.: Deep learning for smart manufacturing: methods and applications. J. Manuf. Syst. 48, 144–156 (2017)
Article Google Scholar
Sculley, D.: Web-scale K-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, New York, NY, USA, pp. 1177–1178. ACM (2010). https://doi.org/10.1145/1772690.1772862
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28(2), 129–137 (2006). https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML 2003, pp. 147–153. AAAI Press (2003), http://dl.acm.org/citation.cfm?id=3041838.3041857
Kaur, D., Kaur, A., Gulati, S., Aggarwal, M.: A clustering algorithm for software fault prediction. In: International Conference on Computer and Communication Technology (ICCCT) (2010). https://doi.org/10.1109/ICCCT.2010.5640474
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Cham (2002)
MATH Google Scholar
Usama, M., Qadir, J., Raza, A., et al.: Unsupervised machine learning for networking: techniques. applications and research challenges. IEEE Access 7, 65579–65615 (2019). https://doi.org/10.1109/ACCESS.2019.2916648
Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012). https://doi.org/10.1145/2347736.2347755
Article Google Scholar
Srivastava, N., Krizhevsky, G.H.A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., Ubayashi, N.: An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, New York, NY, USA, pp. 172–181. ACM (2014). https://doi.org/10.1145/2597073.2597075
Jing, X.Y., Ying, S., Zhang, Z.W., Wu, S.S., Liu, J.: Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp. 414–423 (2014). https://doi.org/10.1145/2568225.2568320

Download references

Acknowledgment

The authors thank the Imagix Corporation for providing an extended free license of Imagix 4D to perform this work.

Author information

Authors and Affiliations

INFN-CNAF, Viale Berti Pichat 6/2, Bologna, Italy
Elisabetta Ronchieri & Marco Canaparo
Elettrical and Computer Engineering Department, Northeastern University, Boston, USA
Mauro Belgiovine

Authors

Elisabetta Ronchieri
View author publications
You can also search for this author in PubMed Google Scholar
Marco Canaparo
View author publications
You can also search for this author in PubMed Google Scholar
Mauro Belgiovine
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elisabetta Ronchieri .

Editor information

Editors and Affiliations

University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Potenza, Italy
Beniamino Murgante
Chair- Center of ICT/ICE, Covenant University, Ota, Nigeria
Sanjay Misra
University of Cagliari, Cagliari, Italy
Chiara Garau
University of Cagliari, Cagliari, Italy
Ivan Blečić
Clayton School of Information Technology, Monash University, Clayton, VIC, Australia
David Taniar
Department of Information Science, Kyushu Sangyo University, Fukuoka, Japan
Bernady O. Apduhan
University of Minho, Braga, Portugal
Ana Maria A.C. Rocha
Polytechnic University of Bari, Bari, Italy
Eufemia Tarantino
Polytechnic University of Bari, Bari, Italy
Carmelo Maria Torre
Department of Neurology, University of Massachusetts Medical School, Worcester, MA, USA
Yeliz Karaca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ronchieri, E., Canaparo, M., Belgiovine, M. (2020). Software Defect Prediction on Unlabelled Datasets: A Comparative Study. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2020. ICCSA 2020. Lecture Notes in Computer Science(), vol 12250. Springer, Cham. https://doi.org/10.1007/978-3-030-58802-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-58802-1_25
Published: 02 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58801-4
Online ISBN: 978-3-030-58802-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics