Skip to main content
Log in

A set of measures designed to identify overlapped instances in software defect prediction

  • Published:
Computing Aims and scope Submit manuscript

Abstract

The performance of the learning models will intensely rely on the characteristics of the training data. The previous outcomes recommend that the overlapping between classes and the presence of noise have the most grounded impact on the performance of learning algorithm, and software defect datasets are no exceptions. The class overlap problem is concerned with the performance of machine learning classifiers critical problem is class overlap in which data samples appear as valid examples of more than one class which may be responsible for the presence of noise in datasets. We aim to investigate how the presence of overlapped instances in a dataset influences the classifier’s performance, and how to deal with class overlapping problem. To have a close estimate of class overlapping, we have proposed four different measures namely, nearest enemy ratio, subconcept ratio, likelihood ratio and soft margin ratio. We performed our investigations using 327 binary defect classification datasets obtained from 54 software projects, where we first identified overlapped datasets using three data complexity measures proposed in the literature. We also include treatment effort into the prediction process. Subsequently, we used our proposed measures to find overlapped instances in the identified overlapped datasets. Our results indicated that by training a classifier on a training data free from overlapped instances led to an improved classifier performance on the test data containing overlapped instances. The classifiers perform significantly better when the evaluation measure takes the effort into account.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://openscience.us/repo/defect/ck/.

References

  1. Basu M, Ho TK (2006) Data complexity in pattern recognition. Springer, Berlin

    Book  MATH  Google Scholar 

  2. Baumgartner R, Somorjai RL (2006) Data complexity assessment in undersampled classification. Pattern Recognit Lett 27:13831389

    Article  Google Scholar 

  3. Bernad-Mansilla E, Ho TK (2005) Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans Evol Comput 9(1):82104

    Google Scholar 

  4. Arora I, Tetarwal V, Saha A (2015) Open issues in software defect prediction. Proc Comput Sci 46:906–912

    Article  Google Scholar 

  5. Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616

    Article  Google Scholar 

  6. Khoshgoftaar TM, Seliya N (2004) Comparative assessment of software quality classification techniques: an empirical case study. Empir Softw Eng 9(3):229–257

    Article  Google Scholar 

  7. Prechelt L, Pepper A (2014) Why software repositories are not used for defect-insertion circumstance analysis more often: a case study. Inf Softw Technol 56(10):1377–1389

    Article  Google Scholar 

  8. Zheng Z, Wu X, Srihari R (1999) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newletter 6:80–89

    Article  Google Scholar 

  9. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):95–215

    Google Scholar 

  10. Sánchez JS, Barandela R, Márques AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022

    Article  Google Scholar 

  11. Luengo J, Herrera F (2012) Shared domains of competence of approximate learning models using measures of separability of classes. Inf Sci 185:4365

    Article  MathSciNet  Google Scholar 

  12. Belohlavek R et al (2009) Inducing decision trees via concept lattices. Int J Gen Syst 38(4):455–467

    Article  MathSciNet  MATH  Google Scholar 

  13. Thereska E, Doebel B, Zheng A, Nobel P (2010) Practical performance models for complex, popular applications. In: Proceedings of ACM, SIGMETRICS

  14. D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4–5):531577

    Google Scholar 

  15. Shull F, Boehm VB, Brown A, Costa P, Lindvall M, Port D, Rus I, Tesoriero R, Zelkowitz M (2002) What we have learned about fighting defects. In: Proceedings of the eighth international software metrics symposium, pp 249–258. Bogazici University http://code.google.com/p/prest/

  16. Kim M, Nam J, Yeon J, Choi S, Kim S (2015) REMI: defect prediction for efficient api testing. In: Proceedings of ESEC/FSE

  17. D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4–5):531–577

    Article  Google Scholar 

  18. Guo J et al (2013) Variability-aware performance prediction: a statistical learning approach. In: 2013 IEEE/ACM 28th international conference on automated software engineering (ASE). IEEE

  19. Ekanayake J et al (2012) Time variance and defect prediction in software projects. Empir Softw Eng 17(4–5):348–389

    Article  Google Scholar 

  20. Abaei G, Selamat A (2015) Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In: Software engineering, artificial intelligence, networking and parallel/distributed computing. Springer International Publishing, pp 179–193

  21. Menzies T, Shepperd M (2012) Special issue on repeatable results in software engineering prediction. Empir Softw Eng 17(1):1–17

    Article  Google Scholar 

  22. Menzies T, Milton Z, Turhan B, Cukic B, Ayse Bener Yue Jiang (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407

    Article  Google Scholar 

  23. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  24. Quinlan J (1992) C4.5: programs for machine learning. Morgan Kaufman, San Mateo

    Google Scholar 

  25. Cohen W (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, Lake Tahoe, CA, pp 115–123

  26. Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228

    Article  Google Scholar 

  27. Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10(1):57–78

    Google Scholar 

  28. Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42(3):203–231

    Article  MATH  Google Scholar 

  29. Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: Proceedings of the 2nd Iberian conference on pattern recognition and image analysis, Springer, Berlin

  30. Orriols-Puig A, Maci N, Ho TK (2010) Documentation for the Data Complexity Library in C++, Technical Report, La Salle—Universitat Ramon Llull

  31. Sánchez JS, Mollineda RA, Sotoca JM (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10:189201

    Article  MathSciNet  Google Scholar 

  32. Luengo J, Herrera F (2010) Domains of competence of fuzzy rule based classification systems with data complexity measures: a case of study using a fuzzy hybrid genetic based machine learning method. Fuzzy Sets Syst 161:319

    Article  MathSciNet  Google Scholar 

  33. Luengo J, Herrera F (2010) An extraction method for the characterization of the fuzzy rule based classification systems behavior using data complexity measures: a case of study with FH-GBML. In: FUZZ-IEEE, IEEE

  34. Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40:20382048

    MATH  Google Scholar 

  35. Sáez JA, Galar M, Luengo J, Herrera F (2013) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowl Inf Syst. doi:10.1007/s10115-012-0570-1

    Google Scholar 

  36. Demar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  Google Scholar 

  37. Hoekstra A, Duin RPW (1997) Investigating redundancy in feed-forward neural classifiers. Pattern Recognit Lett 18(11):1293–1300

    Article  Google Scholar 

  38. Kuncheva LI, Rodrguez JJ (2013) A weighted voting framework for classifiers ensembles. Knowl Inf Syst. doi:10.1007/s10115-012-0586-6

    Google Scholar 

  39. Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. Emerging paradigms in machine learning. Springer, Berlin

    Google Scholar 

  40. Wolpert David H (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390

    Article  Google Scholar 

  41. Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265

    Article  Google Scholar 

  42. Ghosh AK (2006) On optimum choice of k in nearest neighbor classification. Comput Stat Data Anal 50(11):3113–3123

    Article  MathSciNet  MATH  Google Scholar 

  43. Batista GEAPA, Silva DF (2009) How k-nearest neighbor parameters affect its performance. In: Argentine symposium on artificial intelligence. sn

  44. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifieres. In: 5th annual workshop on computational learning theory. ACM, Pittsburgh

  45. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27

    Article  MATH  Google Scholar 

  46. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–137

    Article  MATH  Google Scholar 

  47. Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297

    MATH  Google Scholar 

  48. Derrac J, Triguero I, Garca S, Herrera F (2012) Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms. IEEE Trans Syst Man Cybern Part B 42(5):1383–1397

    Article  Google Scholar 

  49. Vainer I, Kaminka GA, Kraus S, Slovin H (2011) Obtaining scalable and accurate classification in large scale spatio-temporal domains. Knowl Inf Syst 29(3):527–564

    Article  Google Scholar 

  50. Fernández A, Garca S, Jos M, del Jesús MJ, Francisco H (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):23782398

    Article  MathSciNet  Google Scholar 

  51. Arisholm E, Briand LC, Fuglerud M (2007) Data mining techniques for building fault-proneness models in telecom java software. In: Proceedings of the 18th ISSRE. IEEE Press

  52. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, Salvador Garca (2011) Keel datamining software tool: data set repository, integration of algorithms and experimental analysis framework. Multi Valued Log Soft Comput 17(23):255–287

    Google Scholar 

  53. Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1:80–83

    Article  MathSciNet  Google Scholar 

  54. Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the 5th international conference on predictor models in software engineering. ACM

  55. Mende T, Koschke R (2010) Effort-aware defect prediction models. In: 2010 14th European conference on software maintenance and reengineering (CSMR). IEEE

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shivani Gupta.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, S., Gupta, A. A set of measures designed to identify overlapped instances in software defect prediction. Computing 99, 889–914 (2017). https://doi.org/10.1007/s00607-016-0538-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-016-0538-1

Keywords

Navigation