Abstract
Software defect prediction (SDP) is a very important way for analyzing software quality and reducing development costs. The data during software lifecycle can be used to predict software defect. Currently, many SDP models have been proposed; however, their performance was not always ideal. In many existing prediction models based on machine learning, the distance metric between samples has significant impact on the performance of the SDP model. In addition, most samples are usually class imbalanced. To solve these issues, in this paper, a novel distance metric learning based on cost-sensitive learning (CSL) is proposed for reducing the impact of class imbalance of samples, which is then applied to the large margin distribution machine (LDM) to substitute the traditional kernel function. Further, the improvement and optimization of LDM based on CSL are also studied, and the improved LDM is used as the SDP model, called as CS-ILDM. Subsequently, the proposed CS-ILDM is applied to five publicly available data sets from the NASA Metrics Data Program repository and its performance is compared to other existing SDP models. The experimental results confirm that the proposed CS-ILDM not only has good prediction performance, but also can reduce the misprediction cost and avoid the impact of class imbalance of samples.
Similar content being viewed by others
Abbreviations
- Tr:
-
Training sample set
- S :
-
Sample set
- Γ :
-
Label vector
- Δ:
-
DML distance between two samples
- \(s_{i}\) :
-
ith training sample
- \(c_{i}\) :
-
ith class label
- X T :
-
X’s transpose
- \(X^{ - 1}\) :
-
X’s inverse
- R d :
-
d-dimensional Euclidean space
- t :
-
Number of training samples
- m :
-
Number of class labels
- Q :
-
A real-valued matrix
- M :
-
A \(d \times d\) positive semi-definite matrix
- Y :
-
A \(m \times m\) binary matrix
- η :
-
A constant
- \(n_{i}\) :
-
Number of samples with label \(c_{i}\)
- \(l(i)\) :
-
Importance of \(c_{i}\)
- \(\overline{m}\) :
-
Margin mean
- \(\hat{m}\) :
-
Margin variance
- \(\eta_{1} \,,\,\eta_{2}\) :
-
Trade-off parameters
- \(\eta_{3}\) :
-
Penalty parameter
- \(\varepsilon_{i}\) :
-
Misprediction cost of ith training sample
- \(n_{\max }\) :
-
Number of samples in the majority class
- \(n_{\min }\) :
-
Number of samples in the minority class
- \(\kappa\) :
-
A nonnegative parameter
- \(\tilde{\overline{m}}\) :
-
Cost-sensitive margin mean
- \(\tilde{\hat{m}}\) :
-
Cost-sensitive margin variance
- λ :
-
Coefficient vector
- μ, ν :
-
Lagrangian multiplier vectors
- S ( i ) :
-
ith standardized attribute vector
References
Ahmed I, Shabib A, Faseeha M (2019) Performance analysis of resampling techniques on class imbalance issue in software defect prediction. Int J Inf Technol Comput Sci 11:44–53
Ammann P, Offutt J (2016) Introduction to software testing. Cambridge University Press, Cambridge
Arar ÖF, Ayan K (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263–277
Barandela R, Sánchez JS, Garcıa V et al (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36(3):849–851
Bar-Hillel A, Hertz T, Shental N et al (2003) Learning distance functions using equivalence relations. In: 20th international conference on machine learning, 21–24 August 2003, Washington, USA, pp 11–18
Benítez-Peña S, Blanquero R, Carrizosa E et al (2019) Cost-sensitive feature selection for support vector machines. Comput Oper Res 106:169–178
Bradley AP (2013) ROC curve equivalence using the Kolmogorov–Smirnov test. Pattern Recognit Lett 34(5):470–475
Cristianini N, Shawe-Taylor J, Elisseeff A et al (2002) On kernel-target alignment. Adv Neural Inf Process Syst 14:367–373
Czibula G, Marian Z, Czibula IG (2014) Software defect prediction using relational association rule mining. Inf Sci 264:260–278
Davis JV, Kulis B, Jain P et al (2007) Information-theoretic metric learning. In: ACM 24th international conference on machine learning. 20–24 June 2007, Oregon, USA, pp 209–216
Dejaeger K, Verbraken T, Baesens B (2013) Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans Softw Eng 39(2):237–257
Du XT, Zhou ZH, Yin BB, Xiao GP (2020) Cross-project bug type prediction based on transfer learning. Softw Qual J 28:39–57
Elkan C (2001) The foundations of cost-sensitive learning. In: 17th international joint conference on artificial intelligence. 4–10 August 2001, Seattle, USA, II, pp 973–978
Erturk E, Sezer EA (2015) A comparison of some soft computing methods for software fault prediction. Expert Syst Appl 42(4):1872–1879
Ghari PM, Shahbazian R, Ghorashi SA (2019) Maximum entropy-based semi-definite programming for wireless sensor network localization. IEEE Internet Things J 6(2):3480–3491
Goldberger J, Roweis S, Hinton G et al (2005) Neighbourhood components analysis. In: Advances in neural information processing systems, vol 17, Cambridge, MA, pp 513–520
Halstead MH (1977) Elements of software science. North-Holland, New York
Hoo ZH, Candlish J, Teare D (2017) What is an ROC curve? Emerg Med J 34(6):357–359
Hsieh CJ, Chang KW, Lin CJ et al (2008) A dual coordinate descent method for large-scale linear SVM. In: ACM 25th international conference on machine learning, 5–9 July 2008, Helsinki, Finland, pp 408–415
Jabeen G, Yang X, Ping L et al (2017) Hybrid software reliability prediction model based on residual errors. In: 8th IEEE international conference on software engineering and service science, 24–26 November 2017, Beijing, China, pp 479–482
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13(5):561–595
Jin C (2011) Software reliability prediction based on support vector regression using a hybrid genetic algorithm and simulated annealing algorithm. IET Softw 5(4):398–405
Jin C, Jin SW (2014a) Software reliability prediction model based on support vector regression with improved estimation of distribution algorithms. Appl Soft Comput 15:113–120
Jin C, Jin SW (2014b) Applications of fuzzy integrals for predicting software fault-prone. J Intell Fuzzy Syst 26(2):721–729
Jin C, Jin SW (2016a) Parameter optimization of software reliability growth model with S-shaped testing-effort function using improved swarm intelligent optimization. Appl Soft Comput 40:283–291
Jin C, Jin SW (2016b) Image distance metric learning based on neighborhood sets for automatic image annotation. J Vis Commun Image Represent 34:167–175
Jin C, Jin SW (2016c) A multi-label image annotation scheme based on improved SVM multiple kernel learning. In: 8th international conference on graphic and image processing, 29–31 October 2016, Tokyo, Japan, 10225-1-6
Jin C, Jin SW, Ye JM (2012) Artificial neural network-based metric selection for software fault-prone prediction model. IET Softw 6(6):479–487
Katsumata S, Takeda A (2015) Robust cost sensitive support vector machine. In: Eighteenth international conference on artificial intelligence and statistics, 10–12 May 2015, San Diego, USA, pp 434–443
Kim T, Lee K, Baik J (2015) An effective approach to estimating the parameters of software reliability growth models using a real-valued genetic algorithm. J Syst Softw 102:134–144
Lanckriet GRG, Cristianini N, Bartlett P et al (2004) Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5:27–72
Lv YD, Wang Y, Tan YF et al (2017) Pancreatic cancer biomarker detection using recursive feature elimination based on support vector machine and large margin distribution machine. In: 4th international conference on systems and informatics, 11–13 November 2017, Hangzhou China, pp 1450–1455
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 4:308–320
McCabe TJ, Butler CW (1989) Design complexity measurement and testing. Commun ACM 32(12):1415–1425
Menzies T, Di Stefano JS (2004) How good is your blind spot sampling policy. In: Eighth IEEE international symposium on high assurance systems engineering, 25–26 March 2004, Tampa, USA, pp 129–138
Miholca DL, Czibula G, Czibula IG (2018) A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks. Inf Sci 441:152–170
Moepya SO, Akhoury SS, Nelwamondo FV (2014) Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In: 2014 IEEE international conference on data mining workshop, 14 December 2014, Shenzhen, China, pp 183–192
Mutlu B, Sezer EA, Akcayol MA (2018) Automatic rule generation of fuzzy systems: a comparative assessment on software defect prediction. In: IEEE 3rd international conference on computer science and engineering. 20–23 September 2018, Federacija Bosna, pp 209–214
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: ACM 27th international conference on software engineering. 15–21 May 2005, St. Louis, USA, pp 284–292
Noekhah S, Salim NB, Zakaria NH (2017) Predicting software reliability with a novel neural network approach. In: International conference of reliable information and communication technology. Springer, Cham, pp 907–916
Okutan A, Yıldız OT (2014) Software defect prediction using Bayesian networks. Empir Softw Eng 19(1):154–181
Reshma R, Anand P, Chandra S (2018) Large-margin distribution machine-based regression. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3921-3
Samanta K, Ozbolat IT, Koc B (2014) Optimized normal and distance matching for heterogeneous object modeling. Comput Ind Eng 69:1–11
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, Cambridge
Seldag OK, Ayse T (2018) Periodic developer metrics in software defect prediction. In: IEEE 18th international working conference on source code analysis and manipulation. 23–24 September 2018, Madrid, Spain, pp 72–81
Semwal VB, Mondal K, Nandi GC (2017) Robust and accurate feature selection for humanoid push recovery and classification: deep learning approach. Neural Comput Appl 28(3):565–574
Semwal VB, Gaud N, Nandi GC (2019) Human gait state prediction using cellular automata and classification using ELM. In: Tanveer M, Pachori R (eds) Machine intelligence and signal analysis. Advances in intelligent systems and computing, vol 748. Springer, Singapore, pp 135–145
Shigeo A (2017) Unconstrained large margin distribution machines. Pattern Recognit Lett 98(15):96–102
Shull F, Basili V, Boehm B et al (2002) What we have learned about fighting defects. In: Eighth IEEE symposium on software metrics, 4–7 June 2002, Ottawa, Canada, pp 249–258
Silva J, Bacao F, Dieng M et al (2017) Improving specific class mapping from remotely sensed data by cost-sensitive learning. Int J Remote Sens 38(11):3294–33166
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1806–1817
Tang M, Ding SX, Yang C et al (2019) Cost-sensitive large margin distribution machine for fault detection of wind turbines. Clust Comput 22:7525–7537
Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Teshome A, Rao VS (2014) A cost sensitive machine learning approach for intrusion detection. Glob J Comput Sci Technol 14(6-C):1–8
Thwin MMT, Quah TS (2005) Application of neural networks for software quality prediction using object-oriented metrics. J Syst Softw 76(2):147–156
Uricchio T, Ballan L, Seidenari L et al (2017) Automatic image annotation via label transfer in the semantic space. Pattern Recognit 71:144–157
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27(5):1413–1432
Viji C, Rajkumar N, Duraisamy S (2019) Prediction of software fault-prone classes using an unsupervised hybrid SOM algorithm. Clust Comput 22(1):133–143
Wan HY, Wu GQ, Yu ML et al (2019) Software defect prediction based on cost-sensitive dictionary learning. Int J Software Eng Knowl Eng 29(9):1219–1243
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443
Wei YK, Jin C (2019) Locality sensitive discriminant projection for feature extraction and face recognition. J Electron Imaging 28(4):043028
Weinberger KQ, Blitzer J, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
Xu L, Wang B, Liu L et al (2018) Misclassification cost-sensitive software defect prediction. In: IEEE international conference on information reuse and integration, 6–9 July 2018, Salt Lake City, USA, pp 256–263
Ying Y, Li P (2012) Distance metric learning with eigenvalue optimization. J Mach Learn Res 13:1–26
Zhou ZH (2014) Large margin distribution learning. In: IAPR workshop on artificial neural networks in pattern recognition. Springer, Cham, pp 1–11
Zhou Y, Leung H (2006) Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans Softw Eng 32(10):771–789
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that they have no conflict of interest.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jin, C. Software defect prediction model based on distance metric learning. Soft Comput 25, 447–461 (2021). https://doi.org/10.1007/s00500-020-05159-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-05159-1