Skip to main content
Log in

Software defect prediction model based on distance metric learning

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Software defect prediction (SDP) is a very important way for analyzing software quality and reducing development costs. The data during software lifecycle can be used to predict software defect. Currently, many SDP models have been proposed; however, their performance was not always ideal. In many existing prediction models based on machine learning, the distance metric between samples has significant impact on the performance of the SDP model. In addition, most samples are usually class imbalanced. To solve these issues, in this paper, a novel distance metric learning based on cost-sensitive learning (CSL) is proposed for reducing the impact of class imbalance of samples, which is then applied to the large margin distribution machine (LDM) to substitute the traditional kernel function. Further, the improvement and optimization of LDM based on CSL are also studied, and the improved LDM is used as the SDP model, called as CS-ILDM. Subsequently, the proposed CS-ILDM is applied to five publicly available data sets from the NASA Metrics Data Program repository and its performance is compared to other existing SDP models. The experimental results confirm that the proposed CS-ILDM not only has good prediction performance, but also can reduce the misprediction cost and avoid the impact of class imbalance of samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Abbreviations

Tr:

Training sample set

S :

Sample set

Γ :

Label vector

Δ:

DML distance between two samples

\(s_{i}\) :

ith training sample

\(c_{i}\) :

ith class label

X T :

X’s transpose

\(X^{ - 1}\) :

X’s inverse

R d :

d-dimensional Euclidean space

t :

Number of training samples

m :

Number of class labels

Q :

A real-valued matrix

M :

A \(d \times d\) positive semi-definite matrix

Y :

A \(m \times m\) binary matrix

η :

A constant

\(n_{i}\) :

Number of samples with label \(c_{i}\)

\(l(i)\) :

Importance of \(c_{i}\)

\(\overline{m}\) :

Margin mean

\(\hat{m}\) :

Margin variance

\(\eta_{1} \,,\,\eta_{2}\) :

Trade-off parameters

\(\eta_{3}\) :

Penalty parameter

\(\varepsilon_{i}\) :

Misprediction cost of ith training sample

\(n_{\max }\) :

Number of samples in the majority class

\(n_{\min }\) :

Number of samples in the minority class

\(\kappa\) :

A nonnegative parameter

\(\tilde{\overline{m}}\) :

Cost-sensitive margin mean

\(\tilde{\hat{m}}\) :

Cost-sensitive margin variance

λ :

Coefficient vector

μ, ν :

Lagrangian multiplier vectors

S ( i ) :

ith standardized attribute vector

References

  • Ahmed I, Shabib A, Faseeha M (2019) Performance analysis of resampling techniques on class imbalance issue in software defect prediction. Int J Inf Technol Comput Sci 11:44–53

    Google Scholar 

  • Ammann P, Offutt J (2016) Introduction to software testing. Cambridge University Press, Cambridge

    Google Scholar 

  • Arar ÖF, Ayan K (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263–277

    Google Scholar 

  • Barandela R, Sánchez JS, Garcıa V et al (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36(3):849–851

    Google Scholar 

  • Bar-Hillel A, Hertz T, Shental N et al (2003) Learning distance functions using equivalence relations. In: 20th international conference on machine learning, 21–24 August 2003, Washington, USA, pp 11–18

  • Benítez-Peña S, Blanquero R, Carrizosa E et al (2019) Cost-sensitive feature selection for support vector machines. Comput Oper Res 106:169–178

    MathSciNet  MATH  Google Scholar 

  • Bradley AP (2013) ROC curve equivalence using the Kolmogorov–Smirnov test. Pattern Recognit Lett 34(5):470–475

    Google Scholar 

  • Cristianini N, Shawe-Taylor J, Elisseeff A et al (2002) On kernel-target alignment. Adv Neural Inf Process Syst 14:367–373

    Google Scholar 

  • Czibula G, Marian Z, Czibula IG (2014) Software defect prediction using relational association rule mining. Inf Sci 264:260–278

    Google Scholar 

  • Davis JV, Kulis B, Jain P et al (2007) Information-theoretic metric learning. In: ACM 24th international conference on machine learning. 20–24 June 2007, Oregon, USA, pp 209–216

  • Dejaeger K, Verbraken T, Baesens B (2013) Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans Softw Eng 39(2):237–257

    Google Scholar 

  • Du XT, Zhou ZH, Yin BB, Xiao GP (2020) Cross-project bug type prediction based on transfer learning. Softw Qual J 28:39–57

    Google Scholar 

  • Elkan C (2001) The foundations of cost-sensitive learning. In: 17th international joint conference on artificial intelligence. 4–10 August 2001, Seattle, USA, II, pp 973–978

  • Erturk E, Sezer EA (2015) A comparison of some soft computing methods for software fault prediction. Expert Syst Appl 42(4):1872–1879

    Google Scholar 

  • Ghari PM, Shahbazian R, Ghorashi SA (2019) Maximum entropy-based semi-definite programming for wireless sensor network localization. IEEE Internet Things J 6(2):3480–3491

    Google Scholar 

  • Goldberger J, Roweis S, Hinton G et al (2005) Neighbourhood components analysis. In: Advances in neural information processing systems, vol 17, Cambridge, MA, pp 513–520

  • Halstead MH (1977) Elements of software science. North-Holland, New York

    MATH  Google Scholar 

  • Hoo ZH, Candlish J, Teare D (2017) What is an ROC curve? Emerg Med J 34(6):357–359

    Google Scholar 

  • Hsieh CJ, Chang KW, Lin CJ et al (2008) A dual coordinate descent method for large-scale linear SVM. In: ACM 25th international conference on machine learning, 5–9 July 2008, Helsinki, Finland, pp 408–415

  • Jabeen G, Yang X, Ping L et al (2017) Hybrid software reliability prediction model based on residual errors. In: 8th IEEE international conference on software engineering and service science, 24–26 November 2017, Beijing, China, pp 479–482

  • Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13(5):561–595

    Google Scholar 

  • Jin C (2011) Software reliability prediction based on support vector regression using a hybrid genetic algorithm and simulated annealing algorithm. IET Softw 5(4):398–405

    Google Scholar 

  • Jin C, Jin SW (2014a) Software reliability prediction model based on support vector regression with improved estimation of distribution algorithms. Appl Soft Comput 15:113–120

    Google Scholar 

  • Jin C, Jin SW (2014b) Applications of fuzzy integrals for predicting software fault-prone. J Intell Fuzzy Syst 26(2):721–729

    MathSciNet  MATH  Google Scholar 

  • Jin C, Jin SW (2016a) Parameter optimization of software reliability growth model with S-shaped testing-effort function using improved swarm intelligent optimization. Appl Soft Comput 40:283–291

    Google Scholar 

  • Jin C, Jin SW (2016b) Image distance metric learning based on neighborhood sets for automatic image annotation. J Vis Commun Image Represent 34:167–175

    Google Scholar 

  • Jin C, Jin SW (2016c) A multi-label image annotation scheme based on improved SVM multiple kernel learning. In: 8th international conference on graphic and image processing, 29–31 October 2016, Tokyo, Japan, 10225-1-6

  • Jin C, Jin SW, Ye JM (2012) Artificial neural network-based metric selection for software fault-prone prediction model. IET Softw 6(6):479–487

    Google Scholar 

  • Katsumata S, Takeda A (2015) Robust cost sensitive support vector machine. In: Eighteenth international conference on artificial intelligence and statistics, 10–12 May 2015, San Diego, USA, pp 434–443

  • Kim T, Lee K, Baik J (2015) An effective approach to estimating the parameters of software reliability growth models using a real-valued genetic algorithm. J Syst Softw 102:134–144

    Google Scholar 

  • Lanckriet GRG, Cristianini N, Bartlett P et al (2004) Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5:27–72

    MATH  Google Scholar 

  • Lv YD, Wang Y, Tan YF et al (2017) Pancreatic cancer biomarker detection using recursive feature elimination based on support vector machine and large margin distribution machine. In: 4th international conference on systems and informatics, 11–13 November 2017, Hangzhou China, pp 1450–1455

  • McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 4:308–320

    MathSciNet  MATH  Google Scholar 

  • McCabe TJ, Butler CW (1989) Design complexity measurement and testing. Commun ACM 32(12):1415–1425

    Google Scholar 

  • Menzies T, Di Stefano JS (2004) How good is your blind spot sampling policy. In: Eighth IEEE international symposium on high assurance systems engineering, 25–26 March 2004, Tampa, USA, pp 129–138

  • Miholca DL, Czibula G, Czibula IG (2018) A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks. Inf Sci 441:152–170

    MathSciNet  Google Scholar 

  • Moepya SO, Akhoury SS, Nelwamondo FV (2014) Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In: 2014 IEEE international conference on data mining workshop, 14 December 2014, Shenzhen, China, pp 183–192

  • Mutlu B, Sezer EA, Akcayol MA (2018) Automatic rule generation of fuzzy systems: a comparative assessment on software defect prediction. In: IEEE 3rd international conference on computer science and engineering. 20–23 September 2018, Federacija Bosna, pp 209–214

  • Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: ACM 27th international conference on software engineering. 15–21 May 2005, St. Louis, USA, pp 284–292

  • Noekhah S, Salim NB, Zakaria NH (2017) Predicting software reliability with a novel neural network approach. In: International conference of reliable information and communication technology. Springer, Cham, pp 907–916

  • Okutan A, Yıldız OT (2014) Software defect prediction using Bayesian networks. Empir Softw Eng 19(1):154–181

    Google Scholar 

  • Reshma R, Anand P, Chandra S (2018) Large-margin distribution machine-based regression. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3921-3

    Article  Google Scholar 

  • Samanta K, Ozbolat IT, Koc B (2014) Optimized normal and distance matching for heterogeneous object modeling. Comput Ind Eng 69:1–11

    Google Scholar 

  • Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, Cambridge

    Google Scholar 

  • Seldag OK, Ayse T (2018) Periodic developer metrics in software defect prediction. In: IEEE 18th international working conference on source code analysis and manipulation. 23–24 September 2018, Madrid, Spain, pp 72–81

  • Semwal VB, Mondal K, Nandi GC (2017) Robust and accurate feature selection for humanoid push recovery and classification: deep learning approach. Neural Comput Appl 28(3):565–574

    Google Scholar 

  • Semwal VB, Gaud N, Nandi GC (2019) Human gait state prediction using cellular automata and classification using ELM. In: Tanveer M, Pachori R (eds) Machine intelligence and signal analysis. Advances in intelligent systems and computing, vol 748. Springer, Singapore, pp 135–145

    Google Scholar 

  • Shigeo A (2017) Unconstrained large margin distribution machines. Pattern Recognit Lett 98(15):96–102

    Google Scholar 

  • Shull F, Basili V, Boehm B et al (2002) What we have learned about fighting defects. In: Eighth IEEE symposium on software metrics, 4–7 June 2002, Ottawa, Canada, pp 249–258

  • Silva J, Bacao F, Dieng M et al (2017) Improving specific class mapping from remotely sensed data by cost-sensitive learning. Int J Remote Sens 38(11):3294–33166

    Google Scholar 

  • Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1806–1817

    Google Scholar 

  • Tang M, Ding SX, Yang C et al (2019) Cost-sensitive large margin distribution machine for fault detection of wind turbines. Clust Comput 22:7525–7537

    Google Scholar 

  • Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323

    Google Scholar 

  • Teshome A, Rao VS (2014) A cost sensitive machine learning approach for intrusion detection. Glob J Comput Sci Technol 14(6-C):1–8

    Google Scholar 

  • Thwin MMT, Quah TS (2005) Application of neural networks for software quality prediction using object-oriented metrics. J Syst Softw 76(2):147–156

    Google Scholar 

  • Uricchio T, Ballan L, Seidenari L et al (2017) Automatic image annotation via label transfer in the semantic space. Pattern Recognit 71:144–157

    Google Scholar 

  • Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27(5):1413–1432

    MathSciNet  MATH  Google Scholar 

  • Viji C, Rajkumar N, Duraisamy S (2019) Prediction of software fault-prone classes using an unsupervised hybrid SOM algorithm. Clust Comput 22(1):133–143

    Google Scholar 

  • Wan HY, Wu GQ, Yu ML et al (2019) Software defect prediction based on cost-sensitive dictionary learning. Int J Software Eng Knowl Eng 29(9):1219–1243

    Google Scholar 

  • Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443

    Google Scholar 

  • Wei YK, Jin C (2019) Locality sensitive discriminant projection for feature extraction and face recognition. J Electron Imaging 28(4):043028

    MathSciNet  Google Scholar 

  • Weinberger KQ, Blitzer J, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

    MATH  Google Scholar 

  • Xu L, Wang B, Liu L et al (2018) Misclassification cost-sensitive software defect prediction. In: IEEE international conference on information reuse and integration, 6–9 July 2018, Salt Lake City, USA, pp 256–263

  • Ying Y, Li P (2012) Distance metric learning with eigenvalue optimization. J Mach Learn Res 13:1–26

    MathSciNet  MATH  Google Scholar 

  • Zhou ZH (2014) Large margin distribution learning. In: IAPR workshop on artificial neural networks in pattern recognition. Springer, Cham, pp 1–11

  • Zhou Y, Leung H (2006) Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans Softw Eng 32(10):771–789

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cong Jin.

Ethics declarations

Conflict of interest

The author declares that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jin, C. Software defect prediction model based on distance metric learning. Soft Comput 25, 447–461 (2021). https://doi.org/10.1007/s00500-020-05159-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-020-05159-1

Keywords

Navigation