Abstract
The performance of the learning models will intensely rely on the characteristics of the training data. The previous outcomes recommend that the overlapping between classes and the presence of noise have the most grounded impact on the performance of learning algorithm, and software defect datasets are no exceptions. The class overlap problem is concerned with the performance of machine learning classifiers critical problem is class overlap in which data samples appear as valid examples of more than one class which may be responsible for the presence of noise in datasets. We aim to investigate how the presence of overlapped instances in a dataset influences the classifier’s performance, and how to deal with class overlapping problem. To have a close estimate of class overlapping, we have proposed four different measures namely, nearest enemy ratio, subconcept ratio, likelihood ratio and soft margin ratio. We performed our investigations using 327 binary defect classification datasets obtained from 54 software projects, where we first identified overlapped datasets using three data complexity measures proposed in the literature. We also include treatment effort into the prediction process. Subsequently, we used our proposed measures to find overlapped instances in the identified overlapped datasets. Our results indicated that by training a classifier on a training data free from overlapped instances led to an improved classifier performance on the test data containing overlapped instances. The classifiers perform significantly better when the evaluation measure takes the effort into account.
Similar content being viewed by others
References
Basu M, Ho TK (2006) Data complexity in pattern recognition. Springer, Berlin
Baumgartner R, Somorjai RL (2006) Data complexity assessment in undersampled classification. Pattern Recognit Lett 27:13831389
Bernad-Mansilla E, Ho TK (2005) Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans Evol Comput 9(1):82104
Arora I, Tetarwal V, Saha A (2015) Open issues in software defect prediction. Proc Comput Sci 46:906–912
Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616
Khoshgoftaar TM, Seliya N (2004) Comparative assessment of software quality classification techniques: an empirical case study. Empir Softw Eng 9(3):229–257
Prechelt L, Pepper A (2014) Why software repositories are not used for defect-insertion circumstance analysis more often: a case study. Inf Softw Technol 56(10):1377–1389
Zheng Z, Wu X, Srihari R (1999) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newletter 6:80–89
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):95–215
Sánchez JS, Barandela R, Márques AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022
Luengo J, Herrera F (2012) Shared domains of competence of approximate learning models using measures of separability of classes. Inf Sci 185:4365
Belohlavek R et al (2009) Inducing decision trees via concept lattices. Int J Gen Syst 38(4):455–467
Thereska E, Doebel B, Zheng A, Nobel P (2010) Practical performance models for complex, popular applications. In: Proceedings of ACM, SIGMETRICS
D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4–5):531577
Shull F, Boehm VB, Brown A, Costa P, Lindvall M, Port D, Rus I, Tesoriero R, Zelkowitz M (2002) What we have learned about fighting defects. In: Proceedings of the eighth international software metrics symposium, pp 249–258. Bogazici University http://code.google.com/p/prest/
Kim M, Nam J, Yeon J, Choi S, Kim S (2015) REMI: defect prediction for efficient api testing. In: Proceedings of ESEC/FSE
D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4–5):531–577
Guo J et al (2013) Variability-aware performance prediction: a statistical learning approach. In: 2013 IEEE/ACM 28th international conference on automated software engineering (ASE). IEEE
Ekanayake J et al (2012) Time variance and defect prediction in software projects. Empir Softw Eng 17(4–5):348–389
Abaei G, Selamat A (2015) Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In: Software engineering, artificial intelligence, networking and parallel/distributed computing. Springer International Publishing, pp 179–193
Menzies T, Shepperd M (2012) Special issue on repeatable results in software engineering prediction. Empir Softw Eng 17(1):1–17
Menzies T, Milton Z, Turhan B, Cukic B, Ayse Bener Yue Jiang (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Quinlan J (1992) C4.5: programs for machine learning. Morgan Kaufman, San Mateo
Cohen W (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, Lake Tahoe, CA, pp 115–123
Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228
Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10(1):57–78
Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42(3):203–231
Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: Proceedings of the 2nd Iberian conference on pattern recognition and image analysis, Springer, Berlin
Orriols-Puig A, Maci N, Ho TK (2010) Documentation for the Data Complexity Library in C++, Technical Report, La Salle—Universitat Ramon Llull
Sánchez JS, Mollineda RA, Sotoca JM (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10:189201
Luengo J, Herrera F (2010) Domains of competence of fuzzy rule based classification systems with data complexity measures: a case of study using a fuzzy hybrid genetic based machine learning method. Fuzzy Sets Syst 161:319
Luengo J, Herrera F (2010) An extraction method for the characterization of the fuzzy rule based classification systems behavior using data complexity measures: a case of study with FH-GBML. In: FUZZ-IEEE, IEEE
Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40:20382048
Sáez JA, Galar M, Luengo J, Herrera F (2013) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowl Inf Syst. doi:10.1007/s10115-012-0570-1
Demar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Hoekstra A, Duin RPW (1997) Investigating redundancy in feed-forward neural classifiers. Pattern Recognit Lett 18(11):1293–1300
Kuncheva LI, Rodrguez JJ (2013) A weighted voting framework for classifiers ensembles. Knowl Inf Syst. doi:10.1007/s10115-012-0586-6
Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. Emerging paradigms in machine learning. Springer, Berlin
Wolpert David H (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390
Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265
Ghosh AK (2006) On optimum choice of k in nearest neighbor classification. Comput Stat Data Anal 50(11):3113–3123
Batista GEAPA, Silva DF (2009) How k-nearest neighbor parameters affect its performance. In: Argentine symposium on artificial intelligence. sn
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifieres. In: 5th annual workshop on computational learning theory. ACM, Pittsburgh
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–137
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297
Derrac J, Triguero I, Garca S, Herrera F (2012) Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms. IEEE Trans Syst Man Cybern Part B 42(5):1383–1397
Vainer I, Kaminka GA, Kraus S, Slovin H (2011) Obtaining scalable and accurate classification in large scale spatio-temporal domains. Knowl Inf Syst 29(3):527–564
Fernández A, Garca S, Jos M, del Jesús MJ, Francisco H (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):23782398
Arisholm E, Briand LC, Fuglerud M (2007) Data mining techniques for building fault-proneness models in telecom java software. In: Proceedings of the 18th ISSRE. IEEE Press
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, Salvador Garca (2011) Keel datamining software tool: data set repository, integration of algorithms and experimental analysis framework. Multi Valued Log Soft Comput 17(23):255–287
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1:80–83
Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the 5th international conference on predictor models in software engineering. ACM
Mende T, Koschke R (2010) Effort-aware defect prediction models. In: 2010 14th European conference on software maintenance and reengineering (CSMR). IEEE
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gupta, S., Gupta, A. A set of measures designed to identify overlapped instances in software defect prediction. Computing 99, 889–914 (2017). https://doi.org/10.1007/s00607-016-0538-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-016-0538-1