Skip to main content
Log in

Heterogeneous defect prediction with two-stage ensemble learning

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

Heterogeneous defect prediction (HDP) refers to predicting defect-prone software modules in one project (target) using heterogeneous data collected from other projects (source). Recently, several HDP methods have been proposed. However, these methods do not sufficiently incorporate the two characteristics of the defect data: (1) data could be linear inseparable, and (2) data could be highly imbalanced. These two data characteristics make it challenging to build an effective HDP model. In this paper, we propose a novel Two-Stage Ensemble Learning (TSEL) approach to HDP, which contains two stages: ensemble multi-kernel domain adaptation (EMDA) stage and ensemble data sampling (EDS) stage. In the EMDA stage, we develop an Ensemble Multiple Kernel Correlation Alignment (EMKCA) predictor, which combines the advantage of multiple kernel learning and domain adaptation techniques. In the EDS stage, we employ RESample with replacement (RES) technique to learn multiple different EMKCA predictors and use average ensemble to combine them together. These two stages create an ensemble of defect predictors. Extensive experiments on 30 public projects show that the proposed TSEL approach outperforms a range of competing methods. The improvement is 20.14–33.92% in AUC, 36.05–54.78% in f-measure, and 5.48–19.93% in balance, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. http://openscience.us/repo/.

  2. http://www.cse.ust.hk/~scc/ReLink.htm.

  3. http://bug.inf.usi.ch/.

  4. Code is available at https://sites.google.com/site/tselhdp/.

  5. Code is available at https://sites.google.com/site/cstkcca/.

  6. https://en.wikipedia.org/wiki/Mann-Whitney_U_test.

  7. https://CRAN.R-project.org/package=ScottKnottESD.

  8. Code is available at https://sites.google.com/site/enmkca/.

  9. Version R2014a, http://mathworks.com/help/stats/index.html.

References

  • Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)

    Article  Google Scholar 

  • Bin, Y., Zhou, K., Lu, H., Zhou, Y., Xu, B.: Training data selection for cross-project defection prediction: which approach is better? In: ESEM, pp. 354–363 (2017)

  • Camargo Cruz, A.E., Ochimizu, K.: Towards logistic regression models for predicting fault-prone code across software projects. In: ESEM, pp. 460–463 (2009)

  • Canfora, G., Lucia, A.D., Penta, M.D., Oliveto, R., Panichella, A., Panichella, S.: Defect prediction as a multiobjective optimization problem. Softw. Test. Verif. Reliab. 25(4), 426–459 (2015)

    Article  Google Scholar 

  • Chen, L., Fang, B., Shang, Z., Tang, Y.: Negative samples reduction in cross-company software defects prediction. Inf. Softw. Technol. 62, 67–77 (2015)

    Article  Google Scholar 

  • Cheng, M., Wu, G., Jiang, M., Wan, H., You, G., Yuan, M.: Heterogeneous defect prediction via exploiting correlation subspace. In: SEKE, pp. 171–176 (2016)

  • D’Ambros, M., Lanza, M., Robbes, R.: Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir. Softw. Eng. 17(4–5), 531–577 (2012)

    Article  Google Scholar 

  • Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(1), 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  • Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9(9), 1871–1874 (2008)

    MATH  Google Scholar 

  • Fu, W., Menzies, T., Shen, X.: Tuning for software analytics: is it really necessary? Inform. Softw. Technol. 76, 135–146 (2016)

    Article  Google Scholar 

  • Ghotra, B., McIntosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: ICSE, pp. 789–800 (2015)

  • Gönen, M., Alpaydın, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)

    MathSciNet  MATH  Google Scholar 

  • Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)

    Article  Google Scholar 

  • He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  • He, P., Li, B., Ma, Y.: Towards cross-project defect prediction with imbalanced feature sets. CoRR abs/1411.4228 (2014)

  • He, Z., Shu, F., Yang, Y., Li, M., Wang, Q.: An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 19(2), 167–199 (2012)

    Article  Google Scholar 

  • Herbold, S.: Training data selection for cross-project defect prediction. In: PROMISE, pp. 6–15 (2013)

  • Herbold, S.: Comments on scottknottesd in response to “an empirical comparison of model validation techniques for defect prediction models”. IEEE Trans. Softw. Eng. 43(11), 1091–1094 (2017)

    Article  Google Scholar 

  • Herbold, S., Trautsch, A., Grabowski, J.: A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans. Softw. Eng. 44(9), 811–833 (2018)

    Article  Google Scholar 

  • Hosseini, S., Turhan, B., Mäntylä, M.: Search based training data selection for cross project defect prediction. In: PROMISE, pp. 1–10 (2016)

  • Hosseini, S., Turhan, B., Gunarathna, D.: A systematic literature review and meta-analysis on cross project defect prediction. IEEE Trans. Softw. Eng. 45(2), 111–147 (2019)

    Article  Google Scholar 

  • Jing, X.Y., Zhang, D.: A face and palmprint recognition approach based on discriminant dct feature extraction. IEEE Trans. Syst. Man Cybern. B (Cybern.) 34(6), 2405–2415 (2004)

    Article  Google Scholar 

  • Jing, X.Y., Ying, S., Zhang, Z.W., Wu, S.S., Liu, J.: Dictionary learning based software defect prediction. In: ICSE, pp. 414–423 (2014)

  • Jing, X.Y., Wu, F., Dong, X., Qi, F., Xu, B.: Heterogeneous cross-company defect prediction by unified metric representation and cca-based transfer learning. In: FSE, pp. 496–507 (2015)

  • Jing, X.Y., Wu, F., Dong, X., Xu, B.: An improved sda based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans. Softw. Eng. 43(4), 321–338 (2017a)

    Article  Google Scholar 

  • Jing, X.Y., Zhu, X., Wu, F., Hu, R., You, X., Wang, Y., Feng, H., Yang, J.: Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning. IEEE Trans. Image Process. 26(3), 1363–1378 (2017b)

    Article  MathSciNet  Google Scholar 

  • Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: PROMISE, pp. 1–10 (2010)

  • Kamei, Y., Fukushima, T., McIntosh, S., Yamashita, K., Ubayashi, N., Hassan, A.E.: Studying just-in-time defect prediction using cross-project models. Empir. Softw. Eng. 21(5), 2072–2106 (2016)

    Article  Google Scholar 

  • Kim, S., Whitehead, E.J., Zhang, Y.: Classifying software changes: clean or buggy? IEEE Trans. Softw. Eng. 34(2), 181–196 (2008)

    Article  Google Scholar 

  • Kim, S., Zhang, H., Wu, R., Gong, L.: Dealing with noise in defect prediction. In: ICSE, pp. 481–490 (2011)

  • Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Transa. Softw. Eng. 34(4), 485–496 (2008)

    Article  Google Scholar 

  • Li, M., Zhang, H., Wu, R., Zhou, Z.H.: Sample-based software defect prediction with active and semi-supervised learning. Autom. Softw. Eng. 19(2), 201–230 (2012)

    Article  Google Scholar 

  • Li, Z., Jing, X.Y., Zhu, X., Zhang, H.: Heterogeneous defect prediction through multiple kernel learning and ensemble learning. In: ICSME, pp. 91–102 (2017)

  • Li, Z., Jing, X.Y., Wu, F., Zhu, X., Xu, B., Ying, S.: Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Autom. Softw. Eng. 25(2), 201–245 (2018a)

    Article  Google Scholar 

  • Li, Z., Jing, X.Y., Zhu, X.: Heterogeneous fault prediction with cost sensitive domain adaptation. Softw. Test. Verif. Reliab. 28(2), 1–22 (2018b)

    Article  Google Scholar 

  • Li, Z., Jing, X.Y., Zhu, X.: Progress on approaches to software defect prediction. IET Softw. 12(3), 161–175 (2018c)

    Article  Google Scholar 

  • Li, Z., Jing, X.Y., Zhu, X., Zhang, H., Xu, B., Ying, S.: On the multiple sources and privacy preservation issues for heterogeneous defect prediction. IEEE Trans. Softw. Eng. 45(4), 391–411 (2019)

    Article  Google Scholar 

  • Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: CVPR, pp. 2074–2081 (2012)

  • Liu, X.Y., Zhou, Z.H.: Ensemble Methods for Class Imbalance Learning. John Wiley and Sons Inc, Hoboken (2013)

    Book  Google Scholar 

  • Ma, Y., Luo, G., Zeng, X., Chen, A.: Transfer learning for cross-company software defect prediction. Inform. Softw. Technol. 54(3), 248–256 (2012)

    Article  Google Scholar 

  • Malhotra, R., Khanna, M.: An empirical study for software change prediction using imbalanced data. Empir. Softw. Eng. 22(6), 2806–2851 (2017)

    Article  Google Scholar 

  • Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)

    Article  Google Scholar 

  • Menzies, T., Milton, Z., Turhan, B., Cukic, B., Jiang, Y., Bener, A.: Defect prediction from static code features: current results, limitations, new approaches. Autom. Softw. Eng. 17(4), 375–407 (2010)

    Article  Google Scholar 

  • Nam, J., Kim, S. Heterogeneous defect prediction. In: FSE, pp. 508–519 (2015)

  • Nam, J., Pan, S.J., Kim, S. Transfer defect learning. In: ICSE, pp. 382–391 (2013)

  • Nam, J., Fu, W., Kim, S., Menzies, T., Tan, L.: Heterogeneous defect prediction. IEEE Trans. Softw. Eng. 44(9), 874–896 (2018)

    Article  Google Scholar 

  • Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)

    Article  Google Scholar 

  • Peters, F., Menzies, T., Gong, L., Zhang, H.: Balancing privacy and utility in cross-company defect prediction. IEEE Trans. Softw. Eng. 39(8), 1054–1068 (2013)

    Article  Google Scholar 

  • Peters, F., Menzies, T., Layman, L.: Lace2: better privacy-preserving data sharing for cross project defect prediction. In: ICSE, pp. 801–811 (2015)

  • Rahman, F., Posnett, D., Devanbu, P.: Recalling the imprecision of cross-project defect prediction. In: ESEC/FSE, pp. 1–11 (2012)

  • Romano, J., Kromrey, J.D. Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’s d for evaluating group differences on the nsse and other surveys? In: Annual meeting of the Florida Association of Institutional Research, pp. 1–33 (2006)

  • Ryu, D., Choi, O., Baik, J.: Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empir. Softw. Eng. 21(1), 43–71 (2016)

    Article  Google Scholar 

  • Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the nasa software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013)

    Article  Google Scholar 

  • Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: AAAI, pp. 2058–2065 (2016)

  • Tan, M., Tan, L., Dara, S., Mayeux, C.: Online defect prediction for imbalanced data. In: ICSE, pp. 99–108 (2015)

  • Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: Comments on “researcher bias: the use of machine learning in software defect prediction”. IEEE Trans. Softw. Eng. 42(11), 1092–1094 (2016)

    Article  Google Scholar 

  • Tantithamthavorn, C., Mcintosh, S., Hassan, A.E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IEEE Trans. Softw. Eng. 43(1), 1–18 (2017)

    Article  Google Scholar 

  • Tantithamthavorn, C., Mcintosh, S., Hassan, A.E., Matsumoto, K.: The impact of automated parameter optimization on defect prediction models. IEEE Trans. Softw. Eng. (2018). https://doi.org/10.1109/TSE.2018.2794977

    Article  Google Scholar 

  • Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, Cambridge (2009)

    MATH  Google Scholar 

  • Turhan, B., Menzies, T., Bener, A.B., Di Stefano, J.: On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 14(5), 540–578 (2009)

    Article  Google Scholar 

  • Turhan, B., Mısırlı, A.T., Bener, A.: Empirical evaluation of the effects of mixed project data on learning defect predictors. Inform. Softw. Technol. 55(6), 1101–1118 (2013)

    Article  Google Scholar 

  • Vaerenbergh, S.V.: Kernel methods for nonlinear identification, equalization and separation of signals. Universidad de Cantabria (2010)

  • Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect prediction. In: ICSE, pp. 297–308 (2016a)

  • Wang, T., Zhang, Z., Jing, X.Y., Zhang, L.: Multiple kernel ensemble learning for software defect prediction. Autom. Softw. Eng. 23(4), 569–590 (2016b)

    Article  Google Scholar 

  • Weiss, K., Khoshgoftaar, T.M., Wang, D.D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)

    Article  Google Scholar 

  • Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)

    MathSciNet  MATH  Google Scholar 

  • Wu, R., Zhang, H., Kim, S., Cheung, S.C.: Relink: recovering links between bugs and changes. In: ESEC/FSE, pp. 15–25 (2011)

  • Xia, X., Lo, D., Pan, S.J., Nagappan, N., Wang, X.: Hydra: massively compositional model for cross-project defect prediction. IEEE Trans. Softw. Eng. 42(10), 977–998 (2016)

    Article  Google Scholar 

  • Yang, X., Lo, D., Xia, X., Zhang, Y.: Deep learning for just-in-time defect prediction. In: QRS, pp. 17–26 (2015)

  • Yang, X., Lo, D., Xia, X., Sun, J.: Tlel: a two-layer ensemble learning approach for just-in-time defect prediction. Inform. Softw. Technol. 87, 206–220 (2017)

    Article  Google Scholar 

  • Yeh, Y.R., Huang, C.H., Wang, Y.C.F.: Heterogeneous domain adaptation and classification by exploiting the correlation subspace. IEEE Trans. Image Process. 23(5), 2009–2018 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Yu, Q., Jiang, S., Zhang, Y.: A feature matching and transfer approach for cross-company defect prediction. J. Syst. Softw. 132, 366–378 (2017)

    Article  Google Scholar 

  • Zhang, F., Mockus, A., Keivanloo, I., Zou, Y.: Towards building a universal defect prediction model with rank transformed predictors. Empir. Softw. Eng. 21(5), 2107–2145 (2016a)

    Article  Google Scholar 

  • Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: ICSE, pp. 309–320 (2016b)

  • Zhang, H.: An investigation of the relationships between lines of code and defects. In: ICSM, pp. 274–283 (2009)

  • Zhang, H., Tan, H.B.K.: An empirical study of class sizes for large java systems. In: ASPEC, pp. 230–237 (2007)

  • Zhang, H., Zhang, X.: Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 635–637 (2007)

    Article  Google Scholar 

  • Zhang, H., Nelson, A., Menzies, T.: On the value of learning from defect dense components for software defect prediction. In: PROMISE, pp. 1–9 (2010)

  • Zhang, Z., Jing, X.Y., Wang, T.: Label propagation based semi-supervised learning for software defect prediction. Autom. Softw. Eng. 24(1), 47–69 (2017)

    Article  Google Scholar 

  • Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: ESEC/FSE, pp. 91–100 (2009)

Download references

Acknowledgements

The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions. This work was supported by the NSFC-Key Project of General Technology Fundamental Research United Fund under Grant No. U1736211, the National Key Research and Development Program of China under Grant No. 2017YFB0202001, the National Nature Science Foundation of China under Grant Nos. 61672208 and 41571417, the Fundamental Research Funds for the Central Universities No. GK201903086, the Natural Science Foundation Key Project for Innovation Group of Hubei Province under Grant No. 2018CFA024, the Science and Technique Development Program of Henan under Grant Nos. 172102210186 and 182102311066, Higher Education Institution Key Research Projects of Henan Province, No. 19A520001, the Medical Education Research Project of Henan No. Wjlx2016095, and the Scientific Research Staring Foundation of SNNU No.1110011006.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiao-Yuan Jing or Xiaoke Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Jing, XY., Zhu, X. et al. Heterogeneous defect prediction with two-stage ensemble learning. Autom Softw Eng 26, 599–651 (2019). https://doi.org/10.1007/s10515-019-00259-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10515-019-00259-1

Keywords

Navigation