Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction

Niu, Jingwen; Li, Zhiqiang; Chen, Haowen; Dong, Xiwei; Jing, Xiao-Yuan

doi:10.1007/s11219-022-09588-z

Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction

Published: 11 April 2022

Volume 30, pages 917–951, (2022)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

Jingwen Niu^1,2,
Zhiqiang Li ORCID: orcid.org/0000-0001-5999-3658¹,
Haowen Chen³,
Xiwei Dong⁴ &
…
Xiao-Yuan Jing^3,5,6

372 Accesses
3 Citations
Explore all metrics

Abstract

Heterogeneous defect prediction (HDP) refers to identifying more likely defect-proneness of software modules in a target project using heterogeneous metric data from other source projects, which solves the heterogeneous metric problem in cross-project defect prediction. Recently, several mixed-project HDP methods have been presented. However, these models neglect to address the linear inseparability and cross-project class imbalance issues simultaneously. These limitations usually lead to the unsatisfactory performance of HDP. In this paper, we propose an improved transfer learning approach for mixed-project HDP to deal with the above limitations, called data sampling and kernel manifold discriminant alignment (DSKMDA). DSKMDA firstly applies data sampling technique to handle the class imbalance issue. Then it uses kernel manifold discriminant alignment technique to handle the linear inseparability issue. Extensive experiments on 13 projects from three public benchmark datasets with four evaluation measures demonstrate that DSKMDA can produce better or comparable results against a range of competing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Article 19 January 2024

Machine learning techniques applied to mechanical fault diagnosis and fault prognosis in the context of real industrial manufacturing use-cases: a systematic literature review

Article 04 March 2022

References

Canfora, G., Lucia, A. D., Penta, M. D., Oliveto, R., Panichella, A., & Panichella, S. (2015). Defect prediction as a multiobjective optimization problem. Software Testing, Verification and Reliability, 25, 426–459.
Article Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Article MATH Google Scholar
Chen, H., Jing, X. Y., Li, Z., Wu, D., Peng, Y., & Huang, Z. (2021a). An empirical study on heterogeneous defect prediction approaches. IEEE Transactions on Software Engineering, 47, 2803–2822.
Chen, L., Fang, B., Shang, Z., & Tang, Y. (2015). Negative samples reduction in cross-company software defects prediction. Information and Software Technology, 62, 67–77.
Article Google Scholar
Chen, X., Mu, Y., Liu, K., Cui, Z., & Ni, C. (2021b). Revisiting heterogeneous defect prediction methods: How far are we? Information and Software Technology, 130, 106441.
Cheng, M., Wu, G., Jiang, M., Wan, H., You, G., & Yuan, M. (2016). Heterogeneous defect prediction via exploiting correlation subspace. In SEKE’16 (pp. 171–176).
D’Ambros, M., Lanza, M., & Robbes, R. (2012). Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering, 17, 531–577.
Article Google Scholar
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1–30.
MathSciNet MATH Google Scholar
Fu, W., & Menzies, T. (2017). Revisiting unsupervised learning for defect prediction. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering (FSE) (pp. 72–83).
Ghotra, B., McIntosh, S., & Hassan, A. E. (2015). Revisiting the impact of classification techniques on the performance of defect prediction models. In ICSE’15 (pp. 789–800). IEEE.
Gönen, M., & Alpaydin, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12, 2211–2268.
MathSciNet MATH Google Scholar
Golub, G. H., & Van-Loan, C. F. (1996). Matrix computations. Johns Hopkins studies in the mathematical sciences (3rd ed.). Johns Hopkins University Press.
Gong, L., Jiang, S., Bo, L., Jiang, L., & Qian, J. (2020). A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Transactions on Reliability, 69, 40–54.
Article Google Scholar
Gong, L., Jiang, S., & Jiang, L. (2019a). An improved transfer adaptive boosting approach for mixed-project defect prediction. Journal of Software: Evolution and Process, 31, 1–28.
Gong, L., Jiang, S., Yu, Q., & Jiang, L. (2019b). Unsupervised deep domain adaptation for heterogeneous defect prediction. IEICE Transactions on Information and Systems, E102.D, 537–549.
Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2012). A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering, 38, 1276–1304.
Article Google Scholar
Herbold, S., Trautsch, A., & Grabowski, J. (2018). A comparative study to benchmark cross-project defect prediction approaches. IEEE Transactions on Software Engineering, 44, 811–833.
Article Google Scholar
Huang, Q., Xia, X., & Lo, D. (2017). Supervised vs unsupervised models: A holistic look at effort-aware just-in-time defect prediction. In ICSME’17 (pp. 159–170). IEEE.
Jiang, Y., Cukic, B., & Ma, Y. (2008). Techniques for evaluating fault prediction models. Empirical Software Engineering, 13, 561–595.
Article Google Scholar
Jing, X., Wu, F., Dong, X., Qi, F., & Xu, B. (2015). Heterogeneous cross-company defect prediction by unified metric representation and cca-based transfer learning. In FSE’15 (pp. 496–507). ACM.
Jing, X. Y., Wu, F., Dong, X., & Xu, B. (2017). An improved sda based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Transactions on Software Engineering, 43, 321–338.
Article Google Scholar
Jing, X. Y., Ying, S., Zhang, Z. W., Wu, S. S., & Liu, J. (2014). Dictionary learning based software defect prediction. In ICSE’14 (pp. 414–423). ACM.
Kamei, Y., Shihab, E., Adams, B., Hassan, A. E., Mockus, A., Sinha, A., & Ubayashi, N. (2013). A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering, 39, 757–773.
Article Google Scholar
Li, Z., Jing, X. Y., Wu, F., Zhu, X., Xu, B., & Ying, S. (2018a). Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Automated Software Engineering, 25, 201–245.
Li, Z., Jing, X. Y., & Zhu, X. (2018b). Heterogeneous fault prediction with cost sensitive domain adaptation. Software Testing, Verification and Reliability, 28, 1–22.
Li, Z., Jing, X. Y., & Zhu, X. (2018c). Progress on approaches to software defect prediction. IET Software, 12, 161–175.
Li, Z., Jing, X. Y., Zhu, X., & Zhang, H. (2017). Heterogeneous defect prediction through multiple kernel learning and ensemble learning. In ICSME’17 (pp. 91–102). IEEE.
Li, Z., Jing, X. Y., Zhu, X., Zhang, H., Xu, B., & Ying, S. (2019a). Heterogeneous defect prediction with two-stage ensemble learning. Automated Software Engineering, 26, 599–651.
Li, Z., Jing, X. Y., Zhu, X., Zhang, H., Xu, B., & Ying, S. (2019b). On the multiple sources and privacy preservation issues for heterogeneous defect prediction. IEEE Transactions on Software Engineering, 45, 391–411.
Li, Z., Niu, J., Jing, X. Y., Yu, W., & Qi, C. (2021). Cross-project defect prediction via landmark selection-based kernelized discriminant subspace alignment. IEEE Transactions on Reliability, 70, 996–1013.
Article Google Scholar
Limsettho, N., Bennin, K. E., Keung, J. W., Hata, H., & Matsumoto, K. (2018). Cross project defect prediction using class distribution estimation and oversampling. Information and Software Technology, 100, 87–102.
Article Google Scholar
Liu, W., Wang, J., Ji, R., Jiang, Y., & Chang, S. (2012). Supervised hashing with kernels. In CVPR’12 (pp. 2074–2081). IEEE.
Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta, 405, 442–451.
Article Google Scholar
Menzie, T., Krishna, R., & Pryor, D. (2016). The promise repository of empirical software engineering data. http://openscience.us/repo/.
Menzies, T., Greenwald, J., & Frank, A. (2007). Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33, 2–13.
Article Google Scholar
Menzies, T., Milton, Z., Turhan, B., Cukic, B., Jiang, Y., & Bener, A. (2010). Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engineering, 17, 375–407.
Article Google Scholar
Nam, J., Fu, W., Kim, S., Menzies, T., & Tan, L. (2018). Heterogeneous defect prediction. IEEE Transactions on Software Engineering, 44, 874–896.
Article Google Scholar
Nam, J., & Kim, S. (2015). Heterogeneous defect prediction. In FSE’15 (pp. 508–519). ACM.
Nam, J., Pan, S. J., & Kim, S. (2013). Transfer defect learning. In ICSE’13 (pp. 382–391). IEEE.
Omri, S., & Sinz, C. (2020). Deep learning for software defect prediction: A survey. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops (ICSEW’20).
Qiu, S., Lu, L., Jiang, S., & Guo, Y. (2019). An investigation of imbalanced ensemble learning methods for cross-project defect prediction. International Journal of Pattern Recognition and Artificial Intelligence, 33, 1–19.
Article Google Scholar
Romano, J., Kromrey, J. D., Coraggio, J., & Skowronek, J. (2006). Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’s d for evaluating group differences on the nsse and other surveys. In annual meeting of the Florida Association of Institutional Research (pp. 1–33).
Ryu, D., Choi, O., & Baik, J. (2016). Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empirical Software Engineering, 21, 43–71.
Article Google Scholar
Ryu, D., Jang, J.-I., & Baik, J. (2017). A transfer cost-sensitive boosting approach for cross-project defect prediction. Software Quality Journal, 25, 235–272.
Article Google Scholar
Shepperd, M., Song, Q., Sun, Z., & Mair, C. (2013). Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39, 1208–1215.
Article Google Scholar
Song, Q., Guo, Y., & Shepperd, M. (2019). A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Transactions on Software Engineering, 45, 1253–1269.
Article Google Scholar
Sun, Y., Jing, X. Y., Wu, F., & Sun, Y. (2020). Manifold embedded distribution adaptation for cross-project defect prediction. IET Software, 14, 825–838.
Article Google Scholar
Tantithamthavorn, C., Hassan, A. E., & Matsumoto, K. (2020). The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering, 46, 1200–1219.
Article Google Scholar
Tantithamthavorn, C., McIntosh, S., Hassan, A. E., & Matsumoto, K. (2017). An empirical comparison of model validation techniques for defect prediction models. IEEE Transactions on Software Engineering, 43, 1–18.
Article Google Scholar
Tantithamthavorn, C., McIntosh, S., Hassan, A. E., & Matsumoto, K. (2019). The impact of automated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering, 45, 683–711.
Article Google Scholar
Tong, H., Liu, B., & Wang, S. (2021). Kernel spectral embedding transfer ensemble for heterogeneous defect prediction. IEEE Transactions on Software Engineering, 47, 1886–1906.
Google Scholar
Turhan, B., Menzies, T., Bener, A. B., & Di Stefano, J. (2009). On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14, 540–578.
Article Google Scholar
Turhan, B., Mısırlı, A. T., & Bener, A. (2013). Empirical evaluation of the effects of mixed project data on learning defect predictors. Information and Software Technology, 55, 1101–1118.
Article Google Scholar
Wan, Z., Xia, X., Hassan, A. E., Lo, D., & Yang, X. (2020). Perceptions, expectations, and challenges in defect prediction. IEEE Transactions on Software Engineering, 46, 1241–1266.
Article Google Scholar
Wang, A., Zhang, Y., Wu, H., Jiang, K., & Wang, M. (2020a). Few-shot learning based balanced distribution adaptation for heterogeneous defect prediction. IEEE Access, 8, 32989–33001.
Wang, S., Liu, T., Nam, J., & Tan, L. (2020b). Deep semantic feature learning for software defect prediction. IEEE Transactions on Software Engineering, 46, 1267–1293.
Wang, T., Zhang, Z., Jing, X.-Y., & Liu, Y. (2016a). Non-negative sparse-based semiboost for software defect prediction. Software Testing, Verification and Reliability, 26, 498–515.
Wang, T., Zhang, Z., Jing, X.-Y., & Zhang, L. (2016b). Multiple kernel ensemble learning for software defect prediction. Automated Software Engineering, 23, 569–590.
Wu, J., Wu, Y., Niu, N., & Zhou, M. (2021). Mhcpdp: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder. Software Quality Journal, 29, 405–430.
Article Google Scholar
Wu, R., Zhang, H., Kim, S., & Cheung, S.-C. (2011). Relink: recovering links between bugs and changes. In FSE/ESEC’11 (pp. 15–25).
Xia, X., Lo, D., Pan, S. J., Nagappan, N., & Wang, X. (2016). Hydra: massively compositional model for cross-project defect prediction. IEEE Transactions on Software Engineering, 42, 977–998.
Article Google Scholar
Xu, Z., Liu, J., Luo, X., Yang, Z., Zhang, Y., Yuan, P., Tang, Y., & Zhang, T. (2019a). Software defect prediction based on kernel pca and weighted extreme learning machine. Information and Software Technology, 106, 182–200.
Xu, Z., Ye, S., Zhang, T., Xia, Z., & Tang, Y. (2019b). Mvse: Effort-aware heterogeneous defect prediction via multiple-view spectral embedding. In QRS’19 (pp. 10–17).
Xu, Z., Yuan, P., Zhang, T., Tang, Y., Li, S., & Xia, Z. (2018). Hda: Cross-project defect prediction via heterogeneous domain adaptation with dictionary learning. IEEE Access, 6, 57597–57613.
Article Google Scholar
Yang, Y., Zhou, Y., Liu, J., Zhao, Y., Lu, H., Xu, L., Xu, B., & Leung, H. (2016). Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In FSE’16 (pp. 157–168). ACM.
Yu, Q., Jiang, S., & Zhang, Y. (2017). A feature matching and transfer approach for cross-company defect prediction. Journal of Systems and Software, 132, 366–378.
Article Google Scholar
Zhang, F., Mockus, A., Keivanloo, I., & Zou, Y. (2015). Towards building a universal defect prediction model with rank transformed predictors. Empirical Software Engineering, (pp. 1–39).
Zhou, Y., Yang, Y., Lu, H., Chen, L., Li, Y., Zhao, Y., et al. (2018). How far we have progressed in the journey? an examination of cross-project defect prediction. ACM Transactions on Software Engineering and Methodology, 27, 1–51.
Article Google Scholar
Zimmermann, T., Nagappan, N., Gall, H., Giger, E., & Murphy, B. (2009). Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In FSE/ESEC’09 (pp. 91–100). ACM.

Download references

Acknowledgements

The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions. This work was partially supported by the National Natural Science Foundation of China under Grant Nos. 61902228, 62041603 and 62176069, the Natural Science Basic Research Program of Shaanxi Province under Grant No. 2020JQ-422, the Natural Science Foundation of Guangdong Province under Grant No. 2019A1515011076, the Innovation Group of Guangdong Education Department under Grant Nos. 2020KCXTD014 and 2018KCXTD019, the 2019 Key Discipline project of Guangdong Province, the Natural Science Foundation of Jiangxi Province under Grant No. 20202BABL202036, the Fundamental Research Funds for the Central Universities under Grant Nos. GK202103083 and GK202105006, and the project of State Key Laboratory for Novel Software Technology under Grant No. KFKT2021B29.

Author information

Authors and Affiliations

School of Computer Science, Shaanxi Normal University, Xi’an, 710119, China
Jingwen Niu & Zhiqiang Li
School of Computer and Information Engineering, Xinxiang University, Xinxiang, 453003, China
Jingwen Niu
School of Computer Science, Wuhan University, Wuhan, 430072, China
Haowen Chen & Xiao-Yuan Jing
School of Computer and Big Data Science, Jiujiang University, Jiujiang, 332005, China
Xiwei Dong
School of Computer Science and Guangdong Provincial Key Laboratory on Petrochemical Equipment Fault Diagnosis, Guangdong University of Petrochemical Technology, Maoming, 525000, China
Xiao-Yuan Jing
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Xiao-Yuan Jing

Authors

Jingwen Niu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Haowen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiwei Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Yuan Jing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiqiang Li.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Niu, J., Li, Z., Chen, H. et al. Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction. Software Qual J 30, 917–951 (2022). https://doi.org/10.1007/s11219-022-09588-z

Download citation

Accepted: 17 February 2022
Published: 11 April 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s11219-022-09588-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Machine learning techniques applied to mechanical fault diagnosis and fault prognosis in the context of real industrial manufacturing use-cases: a systematic literature review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Machine learning techniques applied to mechanical fault diagnosis and fault prognosis in the context of real industrial manufacturing use-cases: a systematic literature review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation