Abstract
Heterogeneous defect prediction (HDP) refers to predicting defect-prone software modules in one project (target) using heterogeneous data collected from other projects (source). Recently, several HDP methods have been proposed. However, these methods do not sufficiently incorporate the two characteristics of the defect data: (1) data could be linear inseparable, and (2) data could be highly imbalanced. These two data characteristics make it challenging to build an effective HDP model. In this paper, we propose a novel Two-Stage Ensemble Learning (TSEL) approach to HDP, which contains two stages: ensemble multi-kernel domain adaptation (EMDA) stage and ensemble data sampling (EDS) stage. In the EMDA stage, we develop an Ensemble Multiple Kernel Correlation Alignment (EMKCA) predictor, which combines the advantage of multiple kernel learning and domain adaptation techniques. In the EDS stage, we employ RESample with replacement (RES) technique to learn multiple different EMKCA predictors and use average ensemble to combine them together. These two stages create an ensemble of defect predictors. Extensive experiments on 30 public projects show that the proposed TSEL approach outperforms a range of competing methods. The improvement is 20.14–33.92% in AUC, 36.05–54.78% in f-measure, and 5.48–19.93% in balance, respectively.
Similar content being viewed by others
Notes
Code is available at https://sites.google.com/site/tselhdp/.
Code is available at https://sites.google.com/site/cstkcca/.
Code is available at https://sites.google.com/site/enmkca/.
Version R2014a, http://mathworks.com/help/stats/index.html.
References
Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)
Bin, Y., Zhou, K., Lu, H., Zhou, Y., Xu, B.: Training data selection for cross-project defection prediction: which approach is better? In: ESEM, pp. 354–363 (2017)
Camargo Cruz, A.E., Ochimizu, K.: Towards logistic regression models for predicting fault-prone code across software projects. In: ESEM, pp. 460–463 (2009)
Canfora, G., Lucia, A.D., Penta, M.D., Oliveto, R., Panichella, A., Panichella, S.: Defect prediction as a multiobjective optimization problem. Softw. Test. Verif. Reliab. 25(4), 426–459 (2015)
Chen, L., Fang, B., Shang, Z., Tang, Y.: Negative samples reduction in cross-company software defects prediction. Inf. Softw. Technol. 62, 67–77 (2015)
Cheng, M., Wu, G., Jiang, M., Wan, H., You, G., Yuan, M.: Heterogeneous defect prediction via exploiting correlation subspace. In: SEKE, pp. 171–176 (2016)
D’Ambros, M., Lanza, M., Robbes, R.: Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir. Softw. Eng. 17(4–5), 531–577 (2012)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(1), 1–30 (2006)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9(9), 1871–1874 (2008)
Fu, W., Menzies, T., Shen, X.: Tuning for software analytics: is it really necessary? Inform. Softw. Technol. 76, 135–146 (2016)
Ghotra, B., McIntosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: ICSE, pp. 789–800 (2015)
Gönen, M., Alpaydın, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
He, P., Li, B., Ma, Y.: Towards cross-project defect prediction with imbalanced feature sets. CoRR abs/1411.4228 (2014)
He, Z., Shu, F., Yang, Y., Li, M., Wang, Q.: An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 19(2), 167–199 (2012)
Herbold, S.: Training data selection for cross-project defect prediction. In: PROMISE, pp. 6–15 (2013)
Herbold, S.: Comments on scottknottesd in response to “an empirical comparison of model validation techniques for defect prediction models”. IEEE Trans. Softw. Eng. 43(11), 1091–1094 (2017)
Herbold, S., Trautsch, A., Grabowski, J.: A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans. Softw. Eng. 44(9), 811–833 (2018)
Hosseini, S., Turhan, B., Mäntylä, M.: Search based training data selection for cross project defect prediction. In: PROMISE, pp. 1–10 (2016)
Hosseini, S., Turhan, B., Gunarathna, D.: A systematic literature review and meta-analysis on cross project defect prediction. IEEE Trans. Softw. Eng. 45(2), 111–147 (2019)
Jing, X.Y., Zhang, D.: A face and palmprint recognition approach based on discriminant dct feature extraction. IEEE Trans. Syst. Man Cybern. B (Cybern.) 34(6), 2405–2415 (2004)
Jing, X.Y., Ying, S., Zhang, Z.W., Wu, S.S., Liu, J.: Dictionary learning based software defect prediction. In: ICSE, pp. 414–423 (2014)
Jing, X.Y., Wu, F., Dong, X., Qi, F., Xu, B.: Heterogeneous cross-company defect prediction by unified metric representation and cca-based transfer learning. In: FSE, pp. 496–507 (2015)
Jing, X.Y., Wu, F., Dong, X., Xu, B.: An improved sda based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans. Softw. Eng. 43(4), 321–338 (2017a)
Jing, X.Y., Zhu, X., Wu, F., Hu, R., You, X., Wang, Y., Feng, H., Yang, J.: Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning. IEEE Trans. Image Process. 26(3), 1363–1378 (2017b)
Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: PROMISE, pp. 1–10 (2010)
Kamei, Y., Fukushima, T., McIntosh, S., Yamashita, K., Ubayashi, N., Hassan, A.E.: Studying just-in-time defect prediction using cross-project models. Empir. Softw. Eng. 21(5), 2072–2106 (2016)
Kim, S., Whitehead, E.J., Zhang, Y.: Classifying software changes: clean or buggy? IEEE Trans. Softw. Eng. 34(2), 181–196 (2008)
Kim, S., Zhang, H., Wu, R., Gong, L.: Dealing with noise in defect prediction. In: ICSE, pp. 481–490 (2011)
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Transa. Softw. Eng. 34(4), 485–496 (2008)
Li, M., Zhang, H., Wu, R., Zhou, Z.H.: Sample-based software defect prediction with active and semi-supervised learning. Autom. Softw. Eng. 19(2), 201–230 (2012)
Li, Z., Jing, X.Y., Zhu, X., Zhang, H.: Heterogeneous defect prediction through multiple kernel learning and ensemble learning. In: ICSME, pp. 91–102 (2017)
Li, Z., Jing, X.Y., Wu, F., Zhu, X., Xu, B., Ying, S.: Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Autom. Softw. Eng. 25(2), 201–245 (2018a)
Li, Z., Jing, X.Y., Zhu, X.: Heterogeneous fault prediction with cost sensitive domain adaptation. Softw. Test. Verif. Reliab. 28(2), 1–22 (2018b)
Li, Z., Jing, X.Y., Zhu, X.: Progress on approaches to software defect prediction. IET Softw. 12(3), 161–175 (2018c)
Li, Z., Jing, X.Y., Zhu, X., Zhang, H., Xu, B., Ying, S.: On the multiple sources and privacy preservation issues for heterogeneous defect prediction. IEEE Trans. Softw. Eng. 45(4), 391–411 (2019)
Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: CVPR, pp. 2074–2081 (2012)
Liu, X.Y., Zhou, Z.H.: Ensemble Methods for Class Imbalance Learning. John Wiley and Sons Inc, Hoboken (2013)
Ma, Y., Luo, G., Zeng, X., Chen, A.: Transfer learning for cross-company software defect prediction. Inform. Softw. Technol. 54(3), 248–256 (2012)
Malhotra, R., Khanna, M.: An empirical study for software change prediction using imbalanced data. Empir. Softw. Eng. 22(6), 2806–2851 (2017)
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)
Menzies, T., Milton, Z., Turhan, B., Cukic, B., Jiang, Y., Bener, A.: Defect prediction from static code features: current results, limitations, new approaches. Autom. Softw. Eng. 17(4), 375–407 (2010)
Nam, J., Kim, S. Heterogeneous defect prediction. In: FSE, pp. 508–519 (2015)
Nam, J., Pan, S.J., Kim, S. Transfer defect learning. In: ICSE, pp. 382–391 (2013)
Nam, J., Fu, W., Kim, S., Menzies, T., Tan, L.: Heterogeneous defect prediction. IEEE Trans. Softw. Eng. 44(9), 874–896 (2018)
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Peters, F., Menzies, T., Gong, L., Zhang, H.: Balancing privacy and utility in cross-company defect prediction. IEEE Trans. Softw. Eng. 39(8), 1054–1068 (2013)
Peters, F., Menzies, T., Layman, L.: Lace2: better privacy-preserving data sharing for cross project defect prediction. In: ICSE, pp. 801–811 (2015)
Rahman, F., Posnett, D., Devanbu, P.: Recalling the imprecision of cross-project defect prediction. In: ESEC/FSE, pp. 1–11 (2012)
Romano, J., Kromrey, J.D. Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’s d for evaluating group differences on the nsse and other surveys? In: Annual meeting of the Florida Association of Institutional Research, pp. 1–33 (2006)
Ryu, D., Choi, O., Baik, J.: Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empir. Softw. Eng. 21(1), 43–71 (2016)
Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the nasa software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013)
Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: AAAI, pp. 2058–2065 (2016)
Tan, M., Tan, L., Dara, S., Mayeux, C.: Online defect prediction for imbalanced data. In: ICSE, pp. 99–108 (2015)
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: Comments on “researcher bias: the use of machine learning in software defect prediction”. IEEE Trans. Softw. Eng. 42(11), 1092–1094 (2016)
Tantithamthavorn, C., Mcintosh, S., Hassan, A.E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IEEE Trans. Softw. Eng. 43(1), 1–18 (2017)
Tantithamthavorn, C., Mcintosh, S., Hassan, A.E., Matsumoto, K.: The impact of automated parameter optimization on defect prediction models. IEEE Trans. Softw. Eng. (2018). https://doi.org/10.1109/TSE.2018.2794977
Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, Cambridge (2009)
Turhan, B., Menzies, T., Bener, A.B., Di Stefano, J.: On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 14(5), 540–578 (2009)
Turhan, B., Mısırlı, A.T., Bener, A.: Empirical evaluation of the effects of mixed project data on learning defect predictors. Inform. Softw. Technol. 55(6), 1101–1118 (2013)
Vaerenbergh, S.V.: Kernel methods for nonlinear identification, equalization and separation of signals. Universidad de Cantabria (2010)
Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect prediction. In: ICSE, pp. 297–308 (2016a)
Wang, T., Zhang, Z., Jing, X.Y., Zhang, L.: Multiple kernel ensemble learning for software defect prediction. Autom. Softw. Eng. 23(4), 569–590 (2016b)
Weiss, K., Khoshgoftaar, T.M., Wang, D.D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)
Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)
Wu, R., Zhang, H., Kim, S., Cheung, S.C.: Relink: recovering links between bugs and changes. In: ESEC/FSE, pp. 15–25 (2011)
Xia, X., Lo, D., Pan, S.J., Nagappan, N., Wang, X.: Hydra: massively compositional model for cross-project defect prediction. IEEE Trans. Softw. Eng. 42(10), 977–998 (2016)
Yang, X., Lo, D., Xia, X., Zhang, Y.: Deep learning for just-in-time defect prediction. In: QRS, pp. 17–26 (2015)
Yang, X., Lo, D., Xia, X., Sun, J.: Tlel: a two-layer ensemble learning approach for just-in-time defect prediction. Inform. Softw. Technol. 87, 206–220 (2017)
Yeh, Y.R., Huang, C.H., Wang, Y.C.F.: Heterogeneous domain adaptation and classification by exploiting the correlation subspace. IEEE Trans. Image Process. 23(5), 2009–2018 (2014)
Yu, Q., Jiang, S., Zhang, Y.: A feature matching and transfer approach for cross-company defect prediction. J. Syst. Softw. 132, 366–378 (2017)
Zhang, F., Mockus, A., Keivanloo, I., Zou, Y.: Towards building a universal defect prediction model with rank transformed predictors. Empir. Softw. Eng. 21(5), 2107–2145 (2016a)
Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: ICSE, pp. 309–320 (2016b)
Zhang, H.: An investigation of the relationships between lines of code and defects. In: ICSM, pp. 274–283 (2009)
Zhang, H., Tan, H.B.K.: An empirical study of class sizes for large java systems. In: ASPEC, pp. 230–237 (2007)
Zhang, H., Zhang, X.: Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 635–637 (2007)
Zhang, H., Nelson, A., Menzies, T.: On the value of learning from defect dense components for software defect prediction. In: PROMISE, pp. 1–9 (2010)
Zhang, Z., Jing, X.Y., Wang, T.: Label propagation based semi-supervised learning for software defect prediction. Autom. Softw. Eng. 24(1), 47–69 (2017)
Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: ESEC/FSE, pp. 91–100 (2009)
Acknowledgements
The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions. This work was supported by the NSFC-Key Project of General Technology Fundamental Research United Fund under Grant No. U1736211, the National Key Research and Development Program of China under Grant No. 2017YFB0202001, the National Nature Science Foundation of China under Grant Nos. 61672208 and 41571417, the Fundamental Research Funds for the Central Universities No. GK201903086, the Natural Science Foundation Key Project for Innovation Group of Hubei Province under Grant No. 2018CFA024, the Science and Technique Development Program of Henan under Grant Nos. 172102210186 and 182102311066, Higher Education Institution Key Research Projects of Henan Province, No. 19A520001, the Medical Education Research Project of Henan No. Wjlx2016095, and the Scientific Research Staring Foundation of SNNU No.1110011006.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, Z., Jing, XY., Zhu, X. et al. Heterogeneous defect prediction with two-stage ensemble learning. Autom Softw Eng 26, 599–651 (2019). https://doi.org/10.1007/s10515-019-00259-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10515-019-00259-1