Skip to main content
Log in

MHCPDP: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder

  • Published:
Software Quality Journal Aims and scope Submit manuscript

Abstract

Heterogeneous cross-project defect prediction (HCPDP) is aimed at building a defect prediction model for the target project by reusing datasets from source projects, where the source project datasets and target project dataset have different features. Most existing HCPDP methods only remove redundant or unrelated features without exploring the underlying features of cross-project datasets. Additionally, when the transfer learning method is used in HCPDP, these methods ignore the negative effect of transfer learning. In this paper, we propose a novel HCPDP method called multi-source heterogeneous cross-project defect prediction (MHCPDP). To reduce the gap between the target datasets and the source datasets, MHCPDP uses the autoencoder to extract the intermediate features from the original datasets instead of simply removing redundant and unrelated features and adopts a modified autoencoder algorithm to make instance selection for eliminating irrelevant instances from the source domain datasets. Furthermore, by incorporating multiple source projects to increase the number of source datasets, MHCPDP develops a multi-source transfer learning algorithm to reduce the impact of negative transfers and upgrade the performance of the classifier. We comprehensively evaluate MHCPDP on five open source datasets; our experimental results show that MHCPDP not only has significant improvement in two performance metrics but also overcomes the shortcomings of the conventional HCPDP methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Briand, L. C., Melo, W. L., & Wust, J. (2002). Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Transactions on Software Engineering, 28(7)706–720.

  • Chen, H., Jing, X. Y., Li, Z., Wu, D., Peng, Y., & Huang, Z. (2020). An empirical study on heterogeneous defect prediction approaches. IEEE Transactions on Software Engineering, pp. 1–1. https://doi.org/10.1109/TSE.2020.2968520

  • D’Ambros, M., Lanza, M., & Robbes, R. (2010). An extensive comparison of bug prediction approaches. In Proceedings of the Working Conferences on Mining Software Repositories, pp. 31–41.

  • D’Ambros, M., Lanza, M., & Robbes, R. (2012). Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering,17(4), 531-577.

  • Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 1-30.

  • Du, X., Zhou, Z., Yin, B., & Xiao, G. (2020). Cross-project bug type prediction based on transfer learning. Software Quality Journal, 39-57.

  • Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognit. Lett., 27, 861874.

    Google Scholar 

  • Ghotra, B., McIntosh, S., & Hassan, A. E. (2015). Revisiting the impact of classification techniques on the performances of defect prediction model. In Proceedings of the International Conference on Software Engineering, pp. 789–800.

  • He, Z., Shu, F., Yang, Y., Li, M., & Wang, Q. (2012). An investigation on the feasibility of cross-project defect prediction. Automated Software Engineering19(2), 167-199.

  • Herzig, K., Just, S., Rau, A., & Zeller, A. (2013a). Predicting defects using change genealogies. Proceedings In 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE) pp.370–381.

  • Herzig, K., Just, S., Rau, A., & Zeller, A. (2013b). Classifying code changes and predicting defects using change genealogie.

  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural network. Science, 313, 504–506.

    Article  MathSciNet  Google Scholar 

  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression, 398.

  • Hosseini, Seyedrebvar, Turhan, et al. (2019). A systematic literature review and meta-analysis on cross project defect prediction. IEEE Transactions on Software Engineering, 45, pp. 111–147.

  • Jiang, Y., Cukic, B., & Ma, Y. (2008). Techniques for evaluating fault prediction models. Empirical Software Engineering,13(5), 561-595.

  • Jing, X., Wu, F., Dong, X., Qi, F., & Xu, B. (2015). Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pp.496–507.

  • Jureczko, M., & Madeyski, L. (2010). Towards identifying software project clusters with regard to defect prediction. In Proceedings of the International Conference on Predictive Models in Software Engineering, pp. 1–10.

  • Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance-testing system. Psychological Bulletin, 85, 410.

    Article  Google Scholar 

  • Lessmann, S., Baseens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34, pp. 485–496.

  • Lee, T., Nam, J., Han, D., Kim, S., & In, H. P. (2011). Micro interaction metrics for defect prediction. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering (pp. 311-321).

  • Li, Z., Jing, X. Y., Wu, F., et al. (2017). Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction [J]. Automated Software Engineering.

  • Li, Z., Jing, X. Y., Zhu, X., et al. (2017) On the multiple sources and privacy preservation issues for heterogeneous defect prediction [J]. IEEE Transactions on Software Engineering, 1–1.

  • Li, Z., Jing, X. Y., Zhu, X., et al. (2019). Heterogeneous defect prediction with two-stage ensemble learning [J]. Automated Software Engineering, 26(3), 599–651.

    Article  Google Scholar 

  • Ma, Y., Luo, G., Zeng, X., & Chen, A. (2012). Transfer learning for cross-company software defect prediction. Information and Software Technology54(3), 248-256.

  • Menzies, T., Greenwald, J., & Frank, A. (2006). Data mining static code attributes to learn defect predictors. IEEE transactions on software engineering33(1), 2-13.

  • Menzies, T., Greenwald, J., & Frank, A. (2007). Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering, 33, pp 2–13.

  • Nam, J., Fu, W., Kim, S., Menzies, T., & Tan, L. (2015). Heterogeneous defect prediction. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering 508–519: ACM.

  • Nam, J., & Kim, S. (2015). CLAML: Defect prediction on unlabeled datasets. In Proceedings of the International Conference on Automated Software Engineering, pp. 452–463.

  • Nam, J., Pan, S. J., & Kim, S. (2013). Transfer defect learning. Proceedings International Conference on Software Engineering Transfer defect learning. Proceedings International Conference on Software Engineering, pp. 382-391.

  • Pan, S. J., Tsang, I. W., Kwork, J. T., & Yang, Q. (2011). Domain adaption via transfer component analysis. IEEE Transactions on Neural Networks, 22, 199–210.

    Article  Google Scholar 

  • Pan., S. J., & Yang., Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22, pp. 1345–1359.

  • Peters, F., & Menzies, T. (2012). Privacy and utility for defect prediction: Experiments with morph. In 2012 34th International Conference on Software Engineering (ICSE) pp. 189-199. IEEE.

  • Pingclasai, N., Hata, H., & Matsumoto, K. I. (2013). Classifying code changes and predicting defects using change genealogies. Proceedings 20th Asia-Pacific Software Engineering Conference (APSEC), pp. 13–18.

  • Rahman, F., Posnett, D., & Devanbu, P. (2012). Recalling the" imprecision" of cross-project defect prediction. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (pp. 1-11).

  • Tan, B. Y. et al. (2017) Distant domain transfer learning, In 2013 35th International Conference on Software Engineering (ICSE) pp. 382–391: IEEE.

  • Tantithamthavorn, C., McIntosh, S., Hassan, A. E., Ihara, A., & Matsumoto, K. (2015). The impact of mislabelling on the performance and interpretation of defect prediction models. Proc IEEE International Conference on Software Engineering (ICSE), pp. 812–823.

  • Turhan, B., Menzies, T., Bener, A. B., & Di Stefano, J. (2008). On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering14(5), 540-578.

  • Turhan, B., Menzies, T., Bener, A. B., & Di Stefano, J. (2009). On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering14(5), 540-578.

  • Wu, R., Zhang, H., Kim, S., & Cheung, S. C. (2011). Relink: Recovering links between bugs and changes. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering Szeged, pp. 2–13.

  • Xu, Z., Xuan, J., Liu, J., & Cui, X. (2016). MICHAC: Defect prediction via feature selection based on maximal information coefficient with hierarchical agglomerative clustering. Proceedings IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

  • Xu, Z., Yuan, P., & Zhang, T. (2018). HDA: Cross-project defect prediction via heterogeneous domain adaptation with dictionary learning. IEEE Access, 6, 57597–65761.

    Article  Google Scholar 

  • Yao, Y., & Doretto, G. (2010). Boosting for transfer learning with multiple sources. IEEE conference on computer vision and pattern recognition, pp. 1855–1862.

  • Zhang, F. Z., Luo P., He Q., & Shi Z. Z. (2015). Survey on transfer, pp.26–39.

  • Zimmermann, T., Nagappan, N., Gall, H., Giger, E., & Murphy, B. (2009). Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. Proceeding of the 7th Joint Meeting European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 91–100.

Download references

Funding

This work was supported in part by National Key Research and Development Project under grant 2019YFB1706101 and in part by the Science-Technology Foundation of Chongqing, China, under grant cstc2019jscx-mbdx0083.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yingbo Wu .

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

We provide the datasets and the source code of the proposed approach that are used to conduct this study at https://github.com/SE-CQU/sdp.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J., Wu , Y., Niu, N. et al. MHCPDP: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder. Software Qual J 29, 405–430 (2021). https://doi.org/10.1007/s11219-021-09553-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11219-021-09553-2

Keywords

Navigation