Software defect prediction model based on distance metric learning

Jin, Cong

doi:10.1007/s00500-020-05159-1

Software defect prediction model based on distance metric learning

Methodologies and Application
Published: 13 July 2020

Volume 25, pages 447–461, (2021)
Cite this article

Soft Computing Aims and scope Submit manuscript

Cong Jin¹

947 Accesses
20 Citations
Explore all metrics

Abstract

Software defect prediction (SDP) is a very important way for analyzing software quality and reducing development costs. The data during software lifecycle can be used to predict software defect. Currently, many SDP models have been proposed; however, their performance was not always ideal. In many existing prediction models based on machine learning, the distance metric between samples has significant impact on the performance of the SDP model. In addition, most samples are usually class imbalanced. To solve these issues, in this paper, a novel distance metric learning based on cost-sensitive learning (CSL) is proposed for reducing the impact of class imbalance of samples, which is then applied to the large margin distribution machine (LDM) to substitute the traditional kernel function. Further, the improvement and optimization of LDM based on CSL are also studied, and the improved LDM is used as the SDP model, called as CS-ILDM. Subsequently, the proposed CS-ILDM is applied to five publicly available data sets from the NASA Metrics Data Program repository and its performance is compared to other existing SDP models. The experimental results confirm that the proposed CS-ILDM not only has good prediction performance, but also can reduce the misprediction cost and avoid the impact of class imbalance of samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance

Article 30 October 2019

A training sample selection method for predicting software defects

Article 19 September 2022

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

Article 14 September 2015

Abbreviations

Tr:: Training sample set
S :: Sample set
Γ :: Label vector
Δ:: DML distance between two samples
\(s_{i}\) :: ith training sample
\(c_{i}\) :: ith class label
X ^T :: X’s transpose
\(X^{ - 1}\) :: X’s inverse
R ^d :: d-dimensional Euclidean space
t :: Number of training samples
m :: Number of class labels
Q :: A real-valued matrix
M :: A \(d \times d\) positive semi-definite matrix
Y :: A \(m \times m\) binary matrix
η :: A constant
\(n_{i}\) :: Number of samples with label \(c_{i}\)
\(l(i)\) :: Importance of \(c_{i}\)
\(\overline{m}\) :: Margin mean
\(\hat{m}\) :: Margin variance
\(\eta_{1} \,,\,\eta_{2}\) :: Trade-off parameters
\(\eta_{3}\) :: Penalty parameter
\(\varepsilon_{i}\) :: Misprediction cost of ith training sample
\(n_{\max }\) :: Number of samples in the majority class
\(n_{\min }\) :: Number of samples in the minority class
\(\kappa\) :: A nonnegative parameter
\(\tilde{\overline{m}}\) :: Cost-sensitive margin mean
\(\tilde{\hat{m}}\) :: Cost-sensitive margin variance
λ :: Coefficient vector
μ, ν :: Lagrangian multiplier vectors
S ₍ _i ₎ :: ith standardized attribute vector

References

Ahmed I, Shabib A, Faseeha M (2019) Performance analysis of resampling techniques on class imbalance issue in software defect prediction. Int J Inf Technol Comput Sci 11:44–53
Google Scholar
Ammann P, Offutt J (2016) Introduction to software testing. Cambridge University Press, Cambridge
Google Scholar
Arar ÖF, Ayan K (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263–277
Google Scholar
Barandela R, Sánchez JS, Garcıa V et al (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36(3):849–851
Google Scholar
Bar-Hillel A, Hertz T, Shental N et al (2003) Learning distance functions using equivalence relations. In: 20th international conference on machine learning, 21–24 August 2003, Washington, USA, pp 11–18
Benítez-Peña S, Blanquero R, Carrizosa E et al (2019) Cost-sensitive feature selection for support vector machines. Comput Oper Res 106:169–178
MathSciNet MATH Google Scholar
Bradley AP (2013) ROC curve equivalence using the Kolmogorov–Smirnov test. Pattern Recognit Lett 34(5):470–475
Google Scholar
Cristianini N, Shawe-Taylor J, Elisseeff A et al (2002) On kernel-target alignment. Adv Neural Inf Process Syst 14:367–373
Google Scholar
Czibula G, Marian Z, Czibula IG (2014) Software defect prediction using relational association rule mining. Inf Sci 264:260–278
Google Scholar
Davis JV, Kulis B, Jain P et al (2007) Information-theoretic metric learning. In: ACM 24th international conference on machine learning. 20–24 June 2007, Oregon, USA, pp 209–216
Dejaeger K, Verbraken T, Baesens B (2013) Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans Softw Eng 39(2):237–257
Google Scholar
Du XT, Zhou ZH, Yin BB, Xiao GP (2020) Cross-project bug type prediction based on transfer learning. Softw Qual J 28:39–57
Google Scholar
Elkan C (2001) The foundations of cost-sensitive learning. In: 17th international joint conference on artificial intelligence. 4–10 August 2001, Seattle, USA, II, pp 973–978
Erturk E, Sezer EA (2015) A comparison of some soft computing methods for software fault prediction. Expert Syst Appl 42(4):1872–1879
Google Scholar
Ghari PM, Shahbazian R, Ghorashi SA (2019) Maximum entropy-based semi-definite programming for wireless sensor network localization. IEEE Internet Things J 6(2):3480–3491
Google Scholar
Goldberger J, Roweis S, Hinton G et al (2005) Neighbourhood components analysis. In: Advances in neural information processing systems, vol 17, Cambridge, MA, pp 513–520
Halstead MH (1977) Elements of software science. North-Holland, New York
MATH Google Scholar
Hoo ZH, Candlish J, Teare D (2017) What is an ROC curve? Emerg Med J 34(6):357–359
Google Scholar
Hsieh CJ, Chang KW, Lin CJ et al (2008) A dual coordinate descent method for large-scale linear SVM. In: ACM 25th international conference on machine learning, 5–9 July 2008, Helsinki, Finland, pp 408–415
Jabeen G, Yang X, Ping L et al (2017) Hybrid software reliability prediction model based on residual errors. In: 8th IEEE international conference on software engineering and service science, 24–26 November 2017, Beijing, China, pp 479–482
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13(5):561–595
Google Scholar
Jin C (2011) Software reliability prediction based on support vector regression using a hybrid genetic algorithm and simulated annealing algorithm. IET Softw 5(4):398–405
Google Scholar
Jin C, Jin SW (2014a) Software reliability prediction model based on support vector regression with improved estimation of distribution algorithms. Appl Soft Comput 15:113–120
Google Scholar
Jin C, Jin SW (2014b) Applications of fuzzy integrals for predicting software fault-prone. J Intell Fuzzy Syst 26(2):721–729
MathSciNet MATH Google Scholar
Jin C, Jin SW (2016a) Parameter optimization of software reliability growth model with S-shaped testing-effort function using improved swarm intelligent optimization. Appl Soft Comput 40:283–291
Google Scholar
Jin C, Jin SW (2016b) Image distance metric learning based on neighborhood sets for automatic image annotation. J Vis Commun Image Represent 34:167–175
Google Scholar
Jin C, Jin SW (2016c) A multi-label image annotation scheme based on improved SVM multiple kernel learning. In: 8th international conference on graphic and image processing, 29–31 October 2016, Tokyo, Japan, 10225-1-6
Jin C, Jin SW, Ye JM (2012) Artificial neural network-based metric selection for software fault-prone prediction model. IET Softw 6(6):479–487
Google Scholar
Katsumata S, Takeda A (2015) Robust cost sensitive support vector machine. In: Eighteenth international conference on artificial intelligence and statistics, 10–12 May 2015, San Diego, USA, pp 434–443
Kim T, Lee K, Baik J (2015) An effective approach to estimating the parameters of software reliability growth models using a real-valued genetic algorithm. J Syst Softw 102:134–144
Google Scholar
Lanckriet GRG, Cristianini N, Bartlett P et al (2004) Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5:27–72
MATH Google Scholar
Lv YD, Wang Y, Tan YF et al (2017) Pancreatic cancer biomarker detection using recursive feature elimination based on support vector machine and large margin distribution machine. In: 4th international conference on systems and informatics, 11–13 November 2017, Hangzhou China, pp 1450–1455
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 4:308–320
MathSciNet MATH Google Scholar
McCabe TJ, Butler CW (1989) Design complexity measurement and testing. Commun ACM 32(12):1415–1425
Google Scholar
Menzies T, Di Stefano JS (2004) How good is your blind spot sampling policy. In: Eighth IEEE international symposium on high assurance systems engineering, 25–26 March 2004, Tampa, USA, pp 129–138
Miholca DL, Czibula G, Czibula IG (2018) A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks. Inf Sci 441:152–170
MathSciNet Google Scholar
Moepya SO, Akhoury SS, Nelwamondo FV (2014) Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In: 2014 IEEE international conference on data mining workshop, 14 December 2014, Shenzhen, China, pp 183–192
Mutlu B, Sezer EA, Akcayol MA (2018) Automatic rule generation of fuzzy systems: a comparative assessment on software defect prediction. In: IEEE 3rd international conference on computer science and engineering. 20–23 September 2018, Federacija Bosna, pp 209–214
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: ACM 27th international conference on software engineering. 15–21 May 2005, St. Louis, USA, pp 284–292
Noekhah S, Salim NB, Zakaria NH (2017) Predicting software reliability with a novel neural network approach. In: International conference of reliable information and communication technology. Springer, Cham, pp 907–916
Okutan A, Yıldız OT (2014) Software defect prediction using Bayesian networks. Empir Softw Eng 19(1):154–181
Google Scholar
Reshma R, Anand P, Chandra S (2018) Large-margin distribution machine-based regression. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3921-3
Article Google Scholar
Samanta K, Ozbolat IT, Koc B (2014) Optimized normal and distance matching for heterogeneous object modeling. Comput Ind Eng 69:1–11
Google Scholar
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, Cambridge
Google Scholar
Seldag OK, Ayse T (2018) Periodic developer metrics in software defect prediction. In: IEEE 18th international working conference on source code analysis and manipulation. 23–24 September 2018, Madrid, Spain, pp 72–81
Semwal VB, Mondal K, Nandi GC (2017) Robust and accurate feature selection for humanoid push recovery and classification: deep learning approach. Neural Comput Appl 28(3):565–574
Google Scholar
Semwal VB, Gaud N, Nandi GC (2019) Human gait state prediction using cellular automata and classification using ELM. In: Tanveer M, Pachori R (eds) Machine intelligence and signal analysis. Advances in intelligent systems and computing, vol 748. Springer, Singapore, pp 135–145
Google Scholar
Shigeo A (2017) Unconstrained large margin distribution machines. Pattern Recognit Lett 98(15):96–102
Google Scholar
Shull F, Basili V, Boehm B et al (2002) What we have learned about fighting defects. In: Eighth IEEE symposium on software metrics, 4–7 June 2002, Ottawa, Canada, pp 249–258
Silva J, Bacao F, Dieng M et al (2017) Improving specific class mapping from remotely sensed data by cost-sensitive learning. Int J Remote Sens 38(11):3294–33166
Google Scholar
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1806–1817
Google Scholar
Tang M, Ding SX, Yang C et al (2019) Cost-sensitive large margin distribution machine for fault detection of wind turbines. Clust Comput 22:7525–7537
Google Scholar
Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Google Scholar
Teshome A, Rao VS (2014) A cost sensitive machine learning approach for intrusion detection. Glob J Comput Sci Technol 14(6-C):1–8
Google Scholar
Thwin MMT, Quah TS (2005) Application of neural networks for software quality prediction using object-oriented metrics. J Syst Softw 76(2):147–156
Google Scholar
Uricchio T, Ballan L, Seidenari L et al (2017) Automatic image annotation via label transfer in the semantic space. Pattern Recognit 71:144–157
Google Scholar
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27(5):1413–1432
MathSciNet MATH Google Scholar
Viji C, Rajkumar N, Duraisamy S (2019) Prediction of software fault-prone classes using an unsupervised hybrid SOM algorithm. Clust Comput 22(1):133–143
Google Scholar
Wan HY, Wu GQ, Yu ML et al (2019) Software defect prediction based on cost-sensitive dictionary learning. Int J Software Eng Knowl Eng 29(9):1219–1243
Google Scholar
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443
Google Scholar
Wei YK, Jin C (2019) Locality sensitive discriminant projection for feature extraction and face recognition. J Electron Imaging 28(4):043028
MathSciNet Google Scholar
Weinberger KQ, Blitzer J, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
MATH Google Scholar
Xu L, Wang B, Liu L et al (2018) Misclassification cost-sensitive software defect prediction. In: IEEE international conference on information reuse and integration, 6–9 July 2018, Salt Lake City, USA, pp 256–263
Ying Y, Li P (2012) Distance metric learning with eigenvalue optimization. J Mach Learn Res 13:1–26
MathSciNet MATH Google Scholar
Zhou ZH (2014) Large margin distribution learning. In: IAPR workshop on artificial neural networks in pattern recognition. Springer, Cham, pp 1–11
Zhou Y, Leung H (2006) Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans Softw Eng 32(10):771–789
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer, Central China Normal University, Wuhan, 430079, People’s Republic of China
Cong Jin

Authors

Cong Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cong Jin.

Ethics declarations

Conflict of interest

The author declares that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, C. Software defect prediction model based on distance metric learning. Soft Comput 25, 447–461 (2021). https://doi.org/10.1007/s00500-020-05159-1

Download citation

Published: 13 July 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s00500-020-05159-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Software defect prediction model based on distance metric learning

Abstract

Access this article

Similar content being viewed by others

Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance

A training sample selection method for predicting software defects

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Software defect prediction model based on distance metric learning

Abstract

Access this article

Similar content being viewed by others

Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance

A training sample selection method for predicting software defects

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation