A set of measures designed to identify overlapped instances in software defect prediction

Gupta, Shivani; Gupta, Atul

doi:10.1007/s00607-016-0538-1

A set of measures designed to identify overlapped instances in software defect prediction

Published: 10 January 2017

Volume 99, pages 889–914, (2017)
Cite this article

Computing Aims and scope Submit manuscript

Shivani Gupta¹ &
Atul Gupta¹

706 Accesses
14 Citations
Explore all metrics

Abstract

The performance of the learning models will intensely rely on the characteristics of the training data. The previous outcomes recommend that the overlapping between classes and the presence of noise have the most grounded impact on the performance of learning algorithm, and software defect datasets are no exceptions. The class overlap problem is concerned with the performance of machine learning classifiers critical problem is class overlap in which data samples appear as valid examples of more than one class which may be responsible for the presence of noise in datasets. We aim to investigate how the presence of overlapped instances in a dataset influences the classifier’s performance, and how to deal with class overlapping problem. To have a close estimate of class overlapping, we have proposed four different measures namely, nearest enemy ratio, subconcept ratio, likelihood ratio and soft margin ratio. We performed our investigations using 327 binary defect classification datasets obtained from 54 software projects, where we first identified overlapped datasets using three data complexity measures proposed in the literature. We also include treatment effort into the prediction process. Subsequently, we used our proposed measures to find overlapped instances in the identified overlapped datasets. Our results indicated that by training a classifier on a training data free from overlapped instances led to an improved classifier performance on the test data containing overlapped instances. The classifiers perform significantly better when the evaluation measure takes the effort into account.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

http://openscience.us/repo/defect/ck/.

References

Basu M, Ho TK (2006) Data complexity in pattern recognition. Springer, Berlin
Book MATH Google Scholar
Baumgartner R, Somorjai RL (2006) Data complexity assessment in undersampled classification. Pattern Recognit Lett 27:13831389
Article Google Scholar
Bernad-Mansilla E, Ho TK (2005) Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans Evol Comput 9(1):82104
Google Scholar
Arora I, Tetarwal V, Saha A (2015) Open issues in software defect prediction. Proc Comput Sci 46:906–912
Article Google Scholar
Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616
Article Google Scholar
Khoshgoftaar TM, Seliya N (2004) Comparative assessment of software quality classification techniques: an empirical case study. Empir Softw Eng 9(3):229–257
Article Google Scholar
Prechelt L, Pepper A (2014) Why software repositories are not used for defect-insertion circumstance analysis more often: a case study. Inf Softw Technol 56(10):1377–1389
Article Google Scholar
Zheng Z, Wu X, Srihari R (1999) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newletter 6:80–89
Article Google Scholar
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):95–215
Google Scholar
Sánchez JS, Barandela R, Márques AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022
Article Google Scholar
Luengo J, Herrera F (2012) Shared domains of competence of approximate learning models using measures of separability of classes. Inf Sci 185:4365
Article MathSciNet Google Scholar
Belohlavek R et al (2009) Inducing decision trees via concept lattices. Int J Gen Syst 38(4):455–467
Article MathSciNet MATH Google Scholar
Thereska E, Doebel B, Zheng A, Nobel P (2010) Practical performance models for complex, popular applications. In: Proceedings of ACM, SIGMETRICS
D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4–5):531577
Google Scholar
Shull F, Boehm VB, Brown A, Costa P, Lindvall M, Port D, Rus I, Tesoriero R, Zelkowitz M (2002) What we have learned about fighting defects. In: Proceedings of the eighth international software metrics symposium, pp 249–258. Bogazici University http://code.google.com/p/prest/
Kim M, Nam J, Yeon J, Choi S, Kim S (2015) REMI: defect prediction for efficient api testing. In: Proceedings of ESEC/FSE
D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4–5):531–577
Article Google Scholar
Guo J et al (2013) Variability-aware performance prediction: a statistical learning approach. In: 2013 IEEE/ACM 28th international conference on automated software engineering (ASE). IEEE
Ekanayake J et al (2012) Time variance and defect prediction in software projects. Empir Softw Eng 17(4–5):348–389
Article Google Scholar
Abaei G, Selamat A (2015) Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In: Software engineering, artificial intelligence, networking and parallel/distributed computing. Springer International Publishing, pp 179–193
Menzies T, Shepperd M (2012) Special issue on repeatable results in software engineering prediction. Empir Softw Eng 17(1):1–17
Article Google Scholar
Menzies T, Milton Z, Turhan B, Cukic B, Ayse Bener Yue Jiang (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Quinlan J (1992) C4.5: programs for machine learning. Morgan Kaufman, San Mateo
Google Scholar
Cohen W (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, Lake Tahoe, CA, pp 115–123
Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228
Article Google Scholar
Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10(1):57–78
Google Scholar
Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42(3):203–231
Article MATH Google Scholar
Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: Proceedings of the 2nd Iberian conference on pattern recognition and image analysis, Springer, Berlin
Orriols-Puig A, Maci N, Ho TK (2010) Documentation for the Data Complexity Library in C++, Technical Report, La Salle—Universitat Ramon Llull
Sánchez JS, Mollineda RA, Sotoca JM (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10:189201
Article MathSciNet Google Scholar
Luengo J, Herrera F (2010) Domains of competence of fuzzy rule based classification systems with data complexity measures: a case of study using a fuzzy hybrid genetic based machine learning method. Fuzzy Sets Syst 161:319
Article MathSciNet Google Scholar
Luengo J, Herrera F (2010) An extraction method for the characterization of the fuzzy rule based classification systems behavior using data complexity measures: a case of study with FH-GBML. In: FUZZ-IEEE, IEEE
Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40:20382048
MATH Google Scholar
Sáez JA, Galar M, Luengo J, Herrera F (2013) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowl Inf Syst. doi:10.1007/s10115-012-0570-1
Google Scholar
Demar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet Google Scholar
Hoekstra A, Duin RPW (1997) Investigating redundancy in feed-forward neural classifiers. Pattern Recognit Lett 18(11):1293–1300
Article Google Scholar
Kuncheva LI, Rodrguez JJ (2013) A weighted voting framework for classifiers ensembles. Knowl Inf Syst. doi:10.1007/s10115-012-0586-6
Google Scholar
Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. Emerging paradigms in machine learning. Springer, Berlin
Google Scholar
Wolpert David H (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390
Article Google Scholar
Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265
Article Google Scholar
Ghosh AK (2006) On optimum choice of k in nearest neighbor classification. Comput Stat Data Anal 50(11):3113–3123
Article MathSciNet MATH Google Scholar
Batista GEAPA, Silva DF (2009) How k-nearest neighbor parameters affect its performance. In: Argentine symposium on artificial intelligence. sn
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifieres. In: 5th annual workshop on computational learning theory. ACM, Pittsburgh
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Article MATH Google Scholar
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–137
Article MATH Google Scholar
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297
MATH Google Scholar
Derrac J, Triguero I, Garca S, Herrera F (2012) Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms. IEEE Trans Syst Man Cybern Part B 42(5):1383–1397
Article Google Scholar
Vainer I, Kaminka GA, Kraus S, Slovin H (2011) Obtaining scalable and accurate classification in large scale spatio-temporal domains. Knowl Inf Syst 29(3):527–564
Article Google Scholar
Fernández A, Garca S, Jos M, del Jesús MJ, Francisco H (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):23782398
Article MathSciNet Google Scholar
Arisholm E, Briand LC, Fuglerud M (2007) Data mining techniques for building fault-proneness models in telecom java software. In: Proceedings of the 18th ISSRE. IEEE Press
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, Salvador Garca (2011) Keel datamining software tool: data set repository, integration of algorithms and experimental analysis framework. Multi Valued Log Soft Comput 17(23):255–287
Google Scholar
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1:80–83
Article MathSciNet Google Scholar
Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the 5th international conference on predictor models in software engineering. ACM
Mende T, Koschke R (2010) Effort-aware defect prediction models. In: 2010 14th European conference on software maintenance and reengineering (CSMR). IEEE

Download references

Author information

Authors and Affiliations

Indian Institute of Information Technology, Design and Manufacturing Jabalpur, Jabalpur, India
Shivani Gupta & Atul Gupta

Authors

Shivani Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Atul Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shivani Gupta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gupta, S., Gupta, A. A set of measures designed to identify overlapped instances in software defect prediction. Computing 99, 889–914 (2017). https://doi.org/10.1007/s00607-016-0538-1

Download citation

Received: 27 January 2016
Accepted: 28 December 2016
Published: 10 January 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s00607-016-0538-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A set of measures designed to identify overlapped instances in software defect prediction

Abstract

Access this article

Similar content being viewed by others

Combat with Class Overlapping in Software Defect Prediction Using Neighbourhood Metric

Software Defect Prediction on Unlabelled Datasets: A Comparative Study

Parameterized Clustering Cleaning Approach for High-Dimensional Datasets with Class Overlap and Imbalance

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A set of measures designed to identify overlapped instances in software defect prediction

Abstract

Access this article

Similar content being viewed by others

Combat with Class Overlapping in Software Defect Prediction Using Neighbourhood Metric

Software Defect Prediction on Unlabelled Datasets: A Comparative Study

Parameterized Clustering Cleaning Approach for High-Dimensional Datasets with Class Overlap and Imbalance

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation