Heterogeneous defect prediction with two-stage ensemble learning

Li, Zhiqiang; Jing, Xiao-Yuan; Zhu, Xiaoke; Zhang, Hongyu; Xu, Baowen; Ying, Shi

doi:10.1007/s10515-019-00259-1

Heterogeneous defect prediction with two-stage ensemble learning

Published: 04 June 2019

Volume 26, pages 599–651, (2019)
Cite this article

Automated Software Engineering Aims and scope Submit manuscript

Zhiqiang Li ORCID: orcid.org/0000-0001-5999-3658^1,2,
Xiao-Yuan Jing^2,3,
Xiaoke Zhu⁴,
Hongyu Zhang⁵,
Baowen Xu⁶ &
…
Shi Ying²

1474 Accesses
42 Citations
Explore all metrics

Abstract

Heterogeneous defect prediction (HDP) refers to predicting defect-prone software modules in one project (target) using heterogeneous data collected from other projects (source). Recently, several HDP methods have been proposed. However, these methods do not sufficiently incorporate the two characteristics of the defect data: (1) data could be linear inseparable, and (2) data could be highly imbalanced. These two data characteristics make it challenging to build an effective HDP model. In this paper, we propose a novel Two-Stage Ensemble Learning (TSEL) approach to HDP, which contains two stages: ensemble multi-kernel domain adaptation (EMDA) stage and ensemble data sampling (EDS) stage. In the EMDA stage, we develop an Ensemble Multiple Kernel Correlation Alignment (EMKCA) predictor, which combines the advantage of multiple kernel learning and domain adaptation techniques. In the EDS stage, we employ RESample with replacement (RES) technique to learn multiple different EMKCA predictors and use average ensemble to combine them together. These two stages create an ensemble of defect predictors. Extensive experiments on 30 public projects show that the proposed TSEL approach outperforms a range of competing methods. The improvement is 20.14–33.92% in AUC, 36.05–54.78% in f-measure, and 5.48–19.93% in balance, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Fig. 7

Multiple kernel ensemble learning for software defect prediction

Article 07 April 2015

Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction

Article 11 April 2022

Software defect prediction based on nested-stacking and heterogeneous feature selection

Article Open access 20 February 2022

Notes

http://openscience.us/repo/.
http://www.cse.ust.hk/~scc/ReLink.htm.
http://bug.inf.usi.ch/.
Code is available at https://sites.google.com/site/tselhdp/.
Code is available at https://sites.google.com/site/cstkcca/.
https://en.wikipedia.org/wiki/Mann-Whitney_U_test.
https://CRAN.R-project.org/package=ScottKnottESD.
Code is available at https://sites.google.com/site/enmkca/.
Version R2014a, http://mathworks.com/help/stats/index.html.

References

Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)
Article Google Scholar
Bin, Y., Zhou, K., Lu, H., Zhou, Y., Xu, B.: Training data selection for cross-project defection prediction: which approach is better? In: ESEM, pp. 354–363 (2017)
Camargo Cruz, A.E., Ochimizu, K.: Towards logistic regression models for predicting fault-prone code across software projects. In: ESEM, pp. 460–463 (2009)
Canfora, G., Lucia, A.D., Penta, M.D., Oliveto, R., Panichella, A., Panichella, S.: Defect prediction as a multiobjective optimization problem. Softw. Test. Verif. Reliab. 25(4), 426–459 (2015)
Article Google Scholar
Chen, L., Fang, B., Shang, Z., Tang, Y.: Negative samples reduction in cross-company software defects prediction. Inf. Softw. Technol. 62, 67–77 (2015)
Article Google Scholar
Cheng, M., Wu, G., Jiang, M., Wan, H., You, G., Yuan, M.: Heterogeneous defect prediction via exploiting correlation subspace. In: SEKE, pp. 171–176 (2016)
D’Ambros, M., Lanza, M., Robbes, R.: Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir. Softw. Eng. 17(4–5), 531–577 (2012)
Article Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(1), 1–30 (2006)
MathSciNet MATH Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9(9), 1871–1874 (2008)
MATH Google Scholar
Fu, W., Menzies, T., Shen, X.: Tuning for software analytics: is it really necessary? Inform. Softw. Technol. 76, 135–146 (2016)
Article Google Scholar
Ghotra, B., McIntosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: ICSE, pp. 789–800 (2015)
Gönen, M., Alpaydın, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)
MathSciNet MATH Google Scholar
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
He, P., Li, B., Ma, Y.: Towards cross-project defect prediction with imbalanced feature sets. CoRR abs/1411.4228 (2014)
He, Z., Shu, F., Yang, Y., Li, M., Wang, Q.: An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 19(2), 167–199 (2012)
Article Google Scholar
Herbold, S.: Training data selection for cross-project defect prediction. In: PROMISE, pp. 6–15 (2013)
Herbold, S.: Comments on scottknottesd in response to “an empirical comparison of model validation techniques for defect prediction models”. IEEE Trans. Softw. Eng. 43(11), 1091–1094 (2017)
Article Google Scholar
Herbold, S., Trautsch, A., Grabowski, J.: A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans. Softw. Eng. 44(9), 811–833 (2018)
Article Google Scholar
Hosseini, S., Turhan, B., Mäntylä, M.: Search based training data selection for cross project defect prediction. In: PROMISE, pp. 1–10 (2016)
Hosseini, S., Turhan, B., Gunarathna, D.: A systematic literature review and meta-analysis on cross project defect prediction. IEEE Trans. Softw. Eng. 45(2), 111–147 (2019)
Article Google Scholar
Jing, X.Y., Zhang, D.: A face and palmprint recognition approach based on discriminant dct feature extraction. IEEE Trans. Syst. Man Cybern. B (Cybern.) 34(6), 2405–2415 (2004)
Article Google Scholar
Jing, X.Y., Ying, S., Zhang, Z.W., Wu, S.S., Liu, J.: Dictionary learning based software defect prediction. In: ICSE, pp. 414–423 (2014)
Jing, X.Y., Wu, F., Dong, X., Qi, F., Xu, B.: Heterogeneous cross-company defect prediction by unified metric representation and cca-based transfer learning. In: FSE, pp. 496–507 (2015)
Jing, X.Y., Wu, F., Dong, X., Xu, B.: An improved sda based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans. Softw. Eng. 43(4), 321–338 (2017a)
Article Google Scholar
Jing, X.Y., Zhu, X., Wu, F., Hu, R., You, X., Wang, Y., Feng, H., Yang, J.: Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning. IEEE Trans. Image Process. 26(3), 1363–1378 (2017b)
Article MathSciNet Google Scholar
Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: PROMISE, pp. 1–10 (2010)
Kamei, Y., Fukushima, T., McIntosh, S., Yamashita, K., Ubayashi, N., Hassan, A.E.: Studying just-in-time defect prediction using cross-project models. Empir. Softw. Eng. 21(5), 2072–2106 (2016)
Article Google Scholar
Kim, S., Whitehead, E.J., Zhang, Y.: Classifying software changes: clean or buggy? IEEE Trans. Softw. Eng. 34(2), 181–196 (2008)
Article Google Scholar
Kim, S., Zhang, H., Wu, R., Gong, L.: Dealing with noise in defect prediction. In: ICSE, pp. 481–490 (2011)
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Transa. Softw. Eng. 34(4), 485–496 (2008)
Article Google Scholar
Li, M., Zhang, H., Wu, R., Zhou, Z.H.: Sample-based software defect prediction with active and semi-supervised learning. Autom. Softw. Eng. 19(2), 201–230 (2012)
Article Google Scholar
Li, Z., Jing, X.Y., Zhu, X., Zhang, H.: Heterogeneous defect prediction through multiple kernel learning and ensemble learning. In: ICSME, pp. 91–102 (2017)
Li, Z., Jing, X.Y., Wu, F., Zhu, X., Xu, B., Ying, S.: Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Autom. Softw. Eng. 25(2), 201–245 (2018a)
Article Google Scholar
Li, Z., Jing, X.Y., Zhu, X.: Heterogeneous fault prediction with cost sensitive domain adaptation. Softw. Test. Verif. Reliab. 28(2), 1–22 (2018b)
Article Google Scholar
Li, Z., Jing, X.Y., Zhu, X.: Progress on approaches to software defect prediction. IET Softw. 12(3), 161–175 (2018c)
Article Google Scholar
Li, Z., Jing, X.Y., Zhu, X., Zhang, H., Xu, B., Ying, S.: On the multiple sources and privacy preservation issues for heterogeneous defect prediction. IEEE Trans. Softw. Eng. 45(4), 391–411 (2019)
Article Google Scholar
Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: CVPR, pp. 2074–2081 (2012)
Liu, X.Y., Zhou, Z.H.: Ensemble Methods for Class Imbalance Learning. John Wiley and Sons Inc, Hoboken (2013)
Book Google Scholar
Ma, Y., Luo, G., Zeng, X., Chen, A.: Transfer learning for cross-company software defect prediction. Inform. Softw. Technol. 54(3), 248–256 (2012)
Article Google Scholar
Malhotra, R., Khanna, M.: An empirical study for software change prediction using imbalanced data. Empir. Softw. Eng. 22(6), 2806–2851 (2017)
Article Google Scholar
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)
Article Google Scholar
Menzies, T., Milton, Z., Turhan, B., Cukic, B., Jiang, Y., Bener, A.: Defect prediction from static code features: current results, limitations, new approaches. Autom. Softw. Eng. 17(4), 375–407 (2010)
Article Google Scholar
Nam, J., Kim, S. Heterogeneous defect prediction. In: FSE, pp. 508–519 (2015)
Nam, J., Pan, S.J., Kim, S. Transfer defect learning. In: ICSE, pp. 382–391 (2013)
Nam, J., Fu, W., Kim, S., Menzies, T., Tan, L.: Heterogeneous defect prediction. IEEE Trans. Softw. Eng. 44(9), 874–896 (2018)
Article Google Scholar
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Article Google Scholar
Peters, F., Menzies, T., Gong, L., Zhang, H.: Balancing privacy and utility in cross-company defect prediction. IEEE Trans. Softw. Eng. 39(8), 1054–1068 (2013)
Article Google Scholar
Peters, F., Menzies, T., Layman, L.: Lace2: better privacy-preserving data sharing for cross project defect prediction. In: ICSE, pp. 801–811 (2015)
Rahman, F., Posnett, D., Devanbu, P.: Recalling the imprecision of cross-project defect prediction. In: ESEC/FSE, pp. 1–11 (2012)
Romano, J., Kromrey, J.D. Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’s d for evaluating group differences on the nsse and other surveys? In: Annual meeting of the Florida Association of Institutional Research, pp. 1–33 (2006)
Ryu, D., Choi, O., Baik, J.: Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empir. Softw. Eng. 21(1), 43–71 (2016)
Article Google Scholar
Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the nasa software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013)
Article Google Scholar
Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: AAAI, pp. 2058–2065 (2016)
Tan, M., Tan, L., Dara, S., Mayeux, C.: Online defect prediction for imbalanced data. In: ICSE, pp. 99–108 (2015)
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: Comments on “researcher bias: the use of machine learning in software defect prediction”. IEEE Trans. Softw. Eng. 42(11), 1092–1094 (2016)
Article Google Scholar
Tantithamthavorn, C., Mcintosh, S., Hassan, A.E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IEEE Trans. Softw. Eng. 43(1), 1–18 (2017)
Article Google Scholar
Tantithamthavorn, C., Mcintosh, S., Hassan, A.E., Matsumoto, K.: The impact of automated parameter optimization on defect prediction models. IEEE Trans. Softw. Eng. (2018). https://doi.org/10.1109/TSE.2018.2794977
Article Google Scholar
Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, Cambridge (2009)
MATH Google Scholar
Turhan, B., Menzies, T., Bener, A.B., Di Stefano, J.: On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 14(5), 540–578 (2009)
Article Google Scholar
Turhan, B., Mısırlı, A.T., Bener, A.: Empirical evaluation of the effects of mixed project data on learning defect predictors. Inform. Softw. Technol. 55(6), 1101–1118 (2013)
Article Google Scholar
Vaerenbergh, S.V.: Kernel methods for nonlinear identification, equalization and separation of signals. Universidad de Cantabria (2010)
Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect prediction. In: ICSE, pp. 297–308 (2016a)
Wang, T., Zhang, Z., Jing, X.Y., Zhang, L.: Multiple kernel ensemble learning for software defect prediction. Autom. Softw. Eng. 23(4), 569–590 (2016b)
Article Google Scholar
Weiss, K., Khoshgoftaar, T.M., Wang, D.D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)
Article Google Scholar
Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)
MathSciNet MATH Google Scholar
Wu, R., Zhang, H., Kim, S., Cheung, S.C.: Relink: recovering links between bugs and changes. In: ESEC/FSE, pp. 15–25 (2011)
Xia, X., Lo, D., Pan, S.J., Nagappan, N., Wang, X.: Hydra: massively compositional model for cross-project defect prediction. IEEE Trans. Softw. Eng. 42(10), 977–998 (2016)
Article Google Scholar
Yang, X., Lo, D., Xia, X., Zhang, Y.: Deep learning for just-in-time defect prediction. In: QRS, pp. 17–26 (2015)
Yang, X., Lo, D., Xia, X., Sun, J.: Tlel: a two-layer ensemble learning approach for just-in-time defect prediction. Inform. Softw. Technol. 87, 206–220 (2017)
Article Google Scholar
Yeh, Y.R., Huang, C.H., Wang, Y.C.F.: Heterogeneous domain adaptation and classification by exploiting the correlation subspace. IEEE Trans. Image Process. 23(5), 2009–2018 (2014)
Article MathSciNet MATH Google Scholar
Yu, Q., Jiang, S., Zhang, Y.: A feature matching and transfer approach for cross-company defect prediction. J. Syst. Softw. 132, 366–378 (2017)
Article Google Scholar
Zhang, F., Mockus, A., Keivanloo, I., Zou, Y.: Towards building a universal defect prediction model with rank transformed predictors. Empir. Softw. Eng. 21(5), 2107–2145 (2016a)
Article Google Scholar
Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: ICSE, pp. 309–320 (2016b)
Zhang, H.: An investigation of the relationships between lines of code and defects. In: ICSM, pp. 274–283 (2009)
Zhang, H., Tan, H.B.K.: An empirical study of class sizes for large java systems. In: ASPEC, pp. 230–237 (2007)
Zhang, H., Zhang, X.: Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 635–637 (2007)
Article Google Scholar
Zhang, H., Nelson, A., Menzies, T.: On the value of learning from defect dense components for software defect prediction. In: PROMISE, pp. 1–9 (2010)
Zhang, Z., Jing, X.Y., Wang, T.: Label propagation based semi-supervised learning for software defect prediction. Autom. Softw. Eng. 24(1), 47–69 (2017)
Article Google Scholar
Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: ESEC/FSE, pp. 91–100 (2009)

Download references

Acknowledgements

The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions. This work was supported by the NSFC-Key Project of General Technology Fundamental Research United Fund under Grant No. U1736211, the National Key Research and Development Program of China under Grant No. 2017YFB0202001, the National Nature Science Foundation of China under Grant Nos. 61672208 and 41571417, the Fundamental Research Funds for the Central Universities No. GK201903086, the Natural Science Foundation Key Project for Innovation Group of Hubei Province under Grant No. 2018CFA024, the Science and Technique Development Program of Henan under Grant Nos. 172102210186 and 182102311066, Higher Education Institution Key Research Projects of Henan Province, No. 19A520001, the Medical Education Research Project of Henan No. Wjlx2016095, and the Scientific Research Staring Foundation of SNNU No.1110011006.

Author information

Authors and Affiliations

School of Computer Science, Shaanxi Normal University, Xi’an, 710119, China
Zhiqiang Li
School of Computer Science, Wuhan University, Wuhan, 430072, China
Zhiqiang Li, Xiao-Yuan Jing & Shi Ying
School of Automation, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
Xiao-Yuan Jing
School of Computer and Information Engineering, Henan University, Kaifeng, 475001, China
Xiaoke Zhu
School of Electrical Engineering and Computing, The University of Newcastle, Callaghan, NSW, 2308, Australia
Hongyu Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China
Baowen Xu

Authors

Zhiqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Yuan Jing
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoke Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Baowen Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shi Ying
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiao-Yuan Jing or Xiaoke Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Z., Jing, XY., Zhu, X. et al. Heterogeneous defect prediction with two-stage ensemble learning. Autom Softw Eng 26, 599–651 (2019). https://doi.org/10.1007/s10515-019-00259-1

Download citation

Received: 27 May 2018
Accepted: 20 May 2019
Published: 04 June 2019
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s10515-019-00259-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Heterogeneous defect prediction with two-stage ensemble learning

Abstract

Access this article

Similar content being viewed by others

Multiple kernel ensemble learning for software defect prediction

Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction

Software defect prediction based on nested-stacking and heterogeneous feature selection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Heterogeneous defect prediction with two-stage ensemble learning

Abstract

Access this article

Similar content being viewed by others

Multiple kernel ensemble learning for software defect prediction

Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction

Software defect prediction based on nested-stacking and heterogeneous feature selection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation