Transfer learning for cross-company software defect prediction

doi:10.1016/j.infsof.2011.09.007

Information and Software Technology

Volume 54, Issue 3, March 2012, Pages 248-256

https://doi.org/10.1016/j.infsof.2011.09.007 Get rights and content

Abstract

Context

Software defect prediction studies usually built models using within-company data, but very few focused on the prediction models trained with cross-company data. It is difficult to employ these models which are built on the within-company data in practice, because of the lack of these local data repositories. Recently, transfer learning has attracted more and more attention for building classifier in target domain using the data from related source domain. It is very useful in cases when distributions of training and test instances differ, but is it appropriate for cross-company software defect prediction?

Objective

In this paper, we consider the cross-company defect prediction scenario where source and target data are drawn from different companies. In order to harness cross company data, we try to exploit the transfer learning method to build faster and highly effective prediction model.

Method

Unlike the prior works selecting training data which are similar from the test data, we proposed a novel algorithm called Transfer Naive Bayes (TNB), by using the information of all the proper features in training data. Our solution estimates the distribution of the test data, and transfers cross-company data information into the weights of the training data. On these weighted data, the defect prediction model is built.

Results

This article presents a theoretical analysis for the comparative methods, and shows the experiment results on the data sets from different organizations. It indicates that TNB is more accurate in terms of AUC (The area under the receiver operating characteristic curve), within less runtime than the state of the art methods.

Conclusion

It is concluded that when there are too few local training data to train good classifiers, the useful knowledge from different-distribution training data on feature level may help. We are optimistic that our transfer learning method can guide optimal resource allocation strategies, which may reduce software testing cost and increase effectiveness of software testing process.

Introduction

Predicting the quality of software modules is very critical for the high-assurance and mission-critical systems. Within-company defect prediction [1], [2], [3], [4], [5], [6], [7] has been well studied in the last three decades. However, there are rarely local training data available in practice, either because the past local defective modules are expensive to label, or the modules developed belong to strange domains for companies. Fortunately, there exist a lot of public data repositories from different companies. But, to the best of our knowledge, very few studies focused on the prediction model trained with these cross-company data.

Cross-company defect prediction is not a traditional machine learning problem, because the training data and test data are under different distributions. In order to solve this problem, Turhan et al. [8] use a Nearest Neighbor Filter (NN-filter) to select the similar data from source data as training data. They discard dissimilar data, which may contain useful information for training. After that, Zimmermann et al. [9] use decision trees to help managers to estimate precision, recall, and accuracy before attempting a prediction across projects. However their method does not yield good results from used across different projects. We consider this a critical transfer learning problem, as defined by Pan and Yang [10].

Unlike these papers, we develop a novel transfer learning algorithm called Transfer Naive Bayes (TNB) for cross-company defect prediction. Instead of discarding some training samples, we exploit information of all the cross-company data in training step. By weighting the instance of training data based on target set information, we build a weighted Naive Bayes classifier. Finally, we perform analysis on publicly available project data sets from NASA and Turkish local software data sets [11]. Our experimental results show that TNB gives better performance on all the data sets when compared with the state-of-the-art methods.

The rest of this paper is organized as follows. Section 2 briefly reviews the background of the transfer learning techniques and software defect prediction algorithms. Section 3 presents our transfer algorithm, and analyzes the theoretical runtime cost of the algorithm. Section 4 describes the software defect data sets, performance metrics used in this study, and shows the experimental results with discussions. Section 5 finalizes the paper with conclusions and future works.

Section snippets

Transfer learning techniques

Transfer learning techniques allow the domains, tasks, and distributions of the training data and test data to be different, and have been applied successfully in many real-world applications recently. According to [10], transfer learning is defined as follows: Given a source domain D_S and learning task T_S, a target domain D_T and learning task T_T, transfer learning aims to help improve the learning of the target predictive function in D_T using the knowledge in D_S and T_S, where D_S ≠ D_T or T_S ≠ T_T.

Transfer learning for software defect prediction

In this section, we present our Transfer Naive Bayes (TNB) algorithm, based on Naive Bayes. Furthermore, we give the theoretical runtime cost analysis for the algorithm. The main idea of TNB is giving weights to the training samples, according to the similarities between the source and target data on feature level. And then, Naive Bayes classifier is built on these weighted training samples.

Experiments

In this section we evaluate TNB algorithm empirically. We use Naive Bayes classifier in Weka [35] to conduct the CC method. And we implement the NN-filter and TNB methods in Weka environment. We focus on cross-company defect-prone software modules prediction problems in this experiment. As we will show later, TNB significantly improves prediction performance with less time over the sample selecting method when applied to defect data sets from different companies.

Threats to validity

As every empirical experiment, our results are subject to some threats to validity.

Conclusion and future work

In this paper, we addressed the issue of how to predict software defects using cross-company data. In our setting, the labeled training data are available but have a different distribution from the unlabeled test data. We have developed a sample weighting algorithm based on Naive Bayes, called Transfer Naive Bayes.

The TNB algorithm applies the weighted Naive Bayes model by transferring information from the target data to the source data. First, it calculates the each attribute information of

Acknowledgements

This research was partially supported by National High Technology Research and Development Program of China (No. 2007AA01Z443), Research Fund for the Doctoral Program of Higher Education (No. 20070614008), and the Fundamental Research Funds for the Central Universities (No. ZYGX2009J066). We thank the anonymous reviewers for their great helpful comments.

References (38)

N. Fenton et al.
Predicting software defects in varying development lifecycles using Bayesian nets
Information and Software Technology
(2007)
S. Kanmani et al.
Object-oriented software fault prediction using neural networks
Information and Software Technology
(2007)
C. Catal et al.
Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem
Information Sciences
(2009)
A.R. Gray et al.
A Comparison of techniques for developing predictive models of software metrics
Information and Software Technology
(1997)
V.K. Vaishnavi et al.
Object-oriented product metrics: a generic-framework
Information Sciences
(2007)
L. Peng et al.
Data gravitation based classification
Information Sciences
(2009)
T. Fawcett
An introduction to ROC analysis
Pattern Recognition Letters
(2006)
L. Briand et al.
Developing interpretable models with optimized set reduction for identifying high risk software components
IEEE Transactions on Software Engineering
(1993)
T.M. Khoshgoftaar et al.
Application of neural networks to software quality modeling of a very large telecommunications system
IEEE Transactions on Neural Networks
(1997)
T. Menzies et al.
Data mining static code attributes to learn defect predictors
IEEE Transactions on Software Engineering
(2007)

T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, Y. Jiang, Implications of ceiling effects in defect predictors, in:...

B. Turhan et al.

On the relative value of cross-company and within-company data for defect prediction

Empirical Software Engineering

(2009)

T. Zimmermann, N. Nagappan, H. Gall, E. Giger, Cross-project defect prediction: a large scale experiment on data vs....

S.J. Pan, Q. Yang, A survey on transfer learning, Technical Report HKUST-CS 08-08, Department of Computer Science and...

G. Boetticher, T. Menzies, T. Ostrand, The PROMISE Repository of Empirical Software Engineering Data, 2007...

Y. Shi, Z. Lan, W. Liu, W. Bi, Extending semi-supervised learning methods for inductive transfer learning, In: Ninth...

J. Huang, A. Smola, A. Gretton, K.M. Borgwardt, B. Scholkopf, Correcting sample selection bias by unlabeled data, in:...

S. Bickel, M. Bruckner, T. Scheffer, Discriminative learning for differing training and test distributions, in:...

M. Sugiyama, S. Nakajima, H. Kashima, P.V. Buenau, M. Kawanabe, Direct importance estimation with model selection and...

Cited by (421)

A survey on machine learning techniques applied to source code
2024, Journal of Systems and Software
The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis, such as testing and vulnerability detection. Such a large number of studies hinders the community from understanding the current research landscape. This paper aims to summarize the current knowledge in applied machine learning for source code analysis. We review studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we conducted an extensive literature search and identified $494$ studies. We summarize our observations and findings with the help of the identified studies. Our findings suggest that the use of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task and summarize machine learning techniques employed. We identify a comprehensive list of available datasets and tools useable in this context. Finally, the paper discusses perceived challenges in this area, including the availability of standard datasets, reproducibility and replicability, and hardware resources.
Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.
ARRAY: Adaptive triple feature-weighted transfer Naive Bayes for cross-project defect prediction
2023, Journal of Systems and Software
Cross-project defect prediction (CPDP) aims to predict defects of target data by using prediction models trained on the source dataset. However, owing to the huge distribution difference, it is still a challenge to build high-performance CPDP models.
We propose a novel high-performance CPDP method named adaptive triple feature-weighted transfer naive Bayes (ARRAY).
ARRAY is characterized by feature weighted similarity, feature weighted instance weight, and the model adaptive adjustment. Experiments are performed on 34 defect datasets. We compare ARRAY with seven state-of-the-art CPDP methods in terms of area under ROC curve (AUC), F1, and Matthews correlation coefficient (MCC) with statistical testing methods.
Experimental results show that: (1) on average, ARRAY separately improves MCC, AUC, and F1 over the baselines by at least 18.4%, 6.5%, and 4.5%; (2) ARRAY significantly performs better than each baseline on most datasets; (3) ARRAY significantly outperforms all baselines with non-negligible effect size according to post-hoc test.
It can be concluded that: (1) the proposed feature weighted similarity, feature weighted instance weight, and the model adaptive adjustment are very helpful for improving the performance of CPDP models; (2) ARRAY is a more promising alternative for CPDP with common metrics.
A soft computing approach for software defect density prediction
2024, Journal of Software: Evolution and Process
Dynamic Distribution Adaptation in Transfer Learning for Cross-Project Just-in-Time Defect Prediction: A Synergistic Approach Incorporating Kernel Variance Matching, Correlation Alignment, and Categorical Boosting
2024, SSRN
Joint Instance and Feature Adaptation for Heterogeneous Defect Prediction
2024, IEEE Transactions on Reliability
Daanae: A Novel Approach to Cross-Project Defect Prediction Based on Dynamic Adversarial Adaptation Network and Autoencoder
2024, SSRN

View all citing articles on Scopus

View full text

Transfer learning for cross-company software defect prediction

Abstract

Context

Objective

Method

Results

Conclusion

Introduction

Section snippets

Transfer learning techniques

Transfer learning for software defect prediction

Experiments

Threats to validity

Conclusion and future work

Acknowledgements

Information and Software Technology

Information and Software Technology

Information Sciences

Information and Software Technology

Information Sciences

Information Sciences

Pattern Recognition Letters

Developing interpretable models with optimized set reduction for identifying high risk software components

IEEE Transactions on Software Engineering

Application of neural networks to software quality modeling of a very large telecommunications system

IEEE Transactions on Neural Networks

Data mining static code attributes to learn defect predictors

IEEE Transactions on Software Engineering

On the relative value of cross-company and within-company data for defect prediction

Empirical Software Engineering