Abstract
This paper proposes a new method for software-defect prediction based on self-organizing data mining; this method can establish a causal relationship between software metrics and defects. Defect-prediction models were established for intra-project and cross-project scenarios. For intra-project forecasting, this article establishes a self-organizing data mining model, adding a method of smooth data preprocessing to solve the problem of data imbalance. For cross-project forecasting, this article establishes a self-organizing data mining model, solves the difference between the two by finding a source-project instance with a larger correlation coefficient with the target project, and establishes a defect-prediction model for the selected source-project instance. This paper aims to achieve classification and ranking prediction. The proposed method is tested on public-defect datasets. In the classification-prediction experiment, the precision, F-measure, and AUC evaluation indicators of this method are used. In the ranking-prediction experiment, AAE and ARE evaluation by this method are optimized. The algorithm is found to be an efficient and feasible method for software-defect prediction.
Similar content being viewed by others
References
Li ZQ, Jing XY, Zhu XK (2018) Progress on approaches to software defect prediction. IET Softw 12(3):161–175. https://doi.org/10.1049/iet-sen.2017.0148
Khoshgoftaar TM, Allen EB (1999) A comparative study of ordering and classification of fault-prone software modules. Empir Softw Eng 4(2):159–186. https://doi.org/10.1023/A:1009876418873
Catal C, Diri B (2009) Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inf Sci 179(8):1040–1058. https://doi.org/10.1016/j.ins.2008.12.001
Xing F, Guo P, Lyu MR (2005) A novel method for early software quality prediction based on support vector machine. In Proceedings of the 16th IEEE International Symposium on Software Reliability Engineering, Chicago, Illinois, USA, pp 213–222
Yang XX (2013) Metrics-Based Software Defect Prediction, Ph.D. dissertation, University of Science and Technology of China, Hefei, CN
Zhang DP, Liu GQ, Zhang K (2016) Software defect prediction model based on GMDH causal relationship. Comput Sci 43(7):171–175
Herbold S (2018) A systematic mapping study on cross-project defect prediction. Empirical Software Engineering manuscript, pp 1–78
Jing XY, Wu F, Dong WX, Xu BW (2017) An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans Softw Eng 43(4):321–339. https://doi.org/10.1109/TSE.2016.2597849
Wu F, Jing XY, Sun Y, Sun J, Huang L, Cui FY, Sun YF (2018) Cross-project and within-project semisupervised software defect prediction: a unified approach. IEEE Trans Reliab 67(2):581–597. https://doi.org/10.1109/TR.2018.2804922
Sun Y, Jing XY, Wu F, Dong XW, Sun YF, Wang RC (2021) Semi-supervised heterogeneous defect prediction with open-source projects on GitHub. Int J Softw Eng Knowl Eng 31(6):889–916. https://doi.org/10.1142/S0218194021500273
Zhu K, Zhang N, Ying S , Zhu D (2020) Within-project and cross-project just-in-time defect prediction based on denoising autoencoder and convolutional neural network. IET Softw, Doi: https://doi.org/10.1049/iet-sen.2019.0278
Gong LN, Jiang SJ, Bo LL, Jiang L, Qian J (2020) A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Trans Reliab 69(1):40–54. https://doi.org/10.1109/TR.2019.2895462
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496. https://doi.org/10.1109/TSE.2008.35
Taghi MK, Cukic B, Seliya N (2007) An empirical assessment on program module-order models. Quality Technol Quantitative Manag 4(2):171–190. https://doi.org/10.1080/16843703.2007.11673144
Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660. https://doi.org/10.1016/j.jss.2007.07.040
Rathore SS, Kumar S (2017) Towards an ensemble based system for predicting the number of software faults. Expert Syst Appl 82(1):357–382. https://doi.org/10.1016/j.eswa.2017.04.014
Chang CC (2013). Research about software defect priority prediction model based on adaboost-SVM algorithm. Degree of master dissertation. Nanjing University of Posts and Telecommunications of China, Nanjing, CN
Qiao L, Li XS, Umer Q, Guo P (2020) Deep learning based software defect prediction. Neurocomputing 38:100–110. https://doi.org/10.1016/j.neucom.2019.11.067
Bing Z (2015) Researches of automatic modeling based on the principle of self-organization. Ph.D. dissertation, Zhengzhou University of China, Zhengzhou, CN
IVAKHNENKO A G, (1971) Polynomial theory of complex systems. IEEE Trans Syst Man Cybern 1(4):364–378. https://doi.org/10.1109/TSMC.1971.4308320
Li ZG (2013) The improvements of GMDH algorithm and research of the prediction and early warning on coal market system. Degree of master dissertation. Nanjing University of Aeronautics and astronautics of China, Nanjing, CN
Guo FX (2012) Research on credit risk assessment method based on GMDH model and principal component logistic model. Degree of master dissertation. Qingdao University of China, Shandong, CN
He CZ, LV J P, (2001) Study of self-organizing data mining theory and the complexity of economic systems. Syst Eng Theory Pract 21(12):1–5
He CZ, Zhang B, Yu H (2002) Comparison between the self-organizing data mining and artificial neural network. Syst Eng Theory Practice, 22(11)
Kang YL (2006) Study of GDP increase and the influencing factors in Chengdu based on self-organization theory. Ph.D. dissertation. Southeast Jiaotong university of China, Chengdu CN
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artific Intell Res 16(1):321–357. https://doi.org/10.1613/jair.953
Chen L, Fang B, Shang Z, Tang YY (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26:97–125. https://doi.org/10.1007/s11219-016-9342-6
Feng S, Keung J, Yu X, Xiao Y, Zhang M (2021) Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf Softw Technol 139(6):106662. https://doi.org/10.1016/j.infsof.2021.106662
Turhan B, Menzies T, Bener AB, Stefano JD (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578. https://doi.org/10.1007/s10664-008-9103-7
Chen X, Wang LP, Gu Q, Wang Z, Wang QP (2018) A survey on cross-project software defect prediction methods. Chinese J Comput 41(1):254–274. https://doi.org/10.1007/s11219-016-9342-6
Ma Y, Luo GC, Xue Z, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256. https://doi.org/10.1016/j.infsof.2011.09.007
NAM J, PAN S J, KIM S (2013) Transfer defect learning. In Proceedings of the 35th International Conference on Software Engineering (ICSE), San Francisco, USA, pp: 382–391. https://doi.org/10.1109/ICSE.2013.6606584
Wang S, Liu T, Nam J, Tan L (2018) Deep Semantic Feature Learning for Software Defect Prediction. IEEE Trans Softw Eng. Doi: https://doi.org/10.1109/TSE.2018.2877612
Li J, He P, Zhu J, Lyu MR (2017) Software Defect Prediction via Convolutional Neural Network. IEEE Proceedings of International Conference on Software Quality, Reliability and Security (QRS), Prague, Czech Republic, pp: 318–328. https://doi.org/10.1109/QRS.2017.42
Qiu SJ, Xu H, Deng JH, Jiang SY, Lu L (2019) Transfer convolutional neural network for cross-project defect prediction. Appl Sci 9(13):2660. https://doi.org/10.3390/app9132660
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13(5):561–595. https://doi.org/10.1007/s10664-008-9079-3
Read S (2017) The application of ROC analysis in threshold identification, data imbalance and metrics selection for software fault prediction. Innovat Syst Softw Eng 13:201–217. https://doi.org/10.1007/s11334-017-0295-0
Rathore SS, Kumar S (2017) Linear and non-linear heterogeneous ensemble methods to predict the number of faults in software systems. Knowl-Based Syst 119:232–256. https://doi.org/10.1016/j.knosys.2016.12.017
Shepperd M, Song QB, Sun ZB, Mair CL (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215. https://doi.org/10.1109/TSE.2013.11
Caglayan B, Kocaguneli E, Krall J, Peters F, Turhan B (2012) The PROMISE repository of empirical software engineering data. West Virginia University Department of Computer Science
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, TiA2oara, Romania
Wang J, Shen BJ, Chen YT (2012) Compressed C4.5 models for software defect prediction. IEEE Proceeding of International Conference on Quality Software, Xi'an, Shaanxi, China, pp: 13–16.https://doi.org/10.1109/QSIC.2012.19
Wang T, Li W (2010) Naive Bayes software defect prediction Model. In Proceedings of the International Conference on Computational Intelligence and Software Engineering, Wuhan, pp: 1–4. https://doi.org/10.1109/CISE.2010.5677057
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software Defect prediction. IEEE Trans Syst Man, Cybernet, Part C (Application and Reviews), 42(6):1806–1817. https://doi.org/10.1109/TSMCC.2012.2226152
Zheng J (2010) Cost-sensitive boosting neural networks for software defect prediction. Expert Syst Appl 37(6):4537–4543. https://doi.org/10.1016/j.eswa.2009.12.056
Jing, XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning based software defect prediction. In Proceedings International Conference on Software Engineering, IEEE Computer Society, pp 414–423. https://doi.org/10.1145/2568225.2568320
Acknowledgements
This work was supported by the National Engineering Laboratory Project for the Safety Technology of Urban Rail Transit System (Development and Reform Office High Technology [2016] No. 583).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Q., Ren, J. Software-defect prediction within and across projects based on improved self-organizing data mining. J Supercomput 78, 6147–6173 (2022). https://doi.org/10.1007/s11227-021-04113-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04113-8